• PCIe MSI-X interrupts

    From MitchAlsup1@21:1/5 to All on Fri Jun 21 20:35:32 2024
    PCIe has an MSI-X interrupt 'capabillity' which consists of
    a number (n) interrupt desctiptors and an associated Pending
    Bit Array where each bit in PBA has a corresponding 128-bit
    desctiptor. A descriptor contains a 64-bit address, a 32-bit
    message, and a 32-bit vector control word.

    There are 2-levels of enablement, one at the MSI-X configura-
    tion control register and one in each interrupt descriptor at
    vector control bit[31].

    As the device raises an interrupt, it sets a bit in PBA.

    When MSI-X is enabled and a bit in PBA is set (1) and the
    vector control bit[31] is enabled, the device sends a
    write of the message to the address in the descriptor,
    and clears the bit in PBA.

    I am assuming that the MSI-X enable bit is used to throttle
    a device so that it sends bursts of interrupts to optimize
    the caching behavior of the cores handling the interrupts.
    run applications->handle k interrupts->run applications.
    A home machine would not use this featrue as the interrupt
    load is small, but a GB server might more control over when.
    But does anybody know ??

    a) device dommand to interrupt descriptor mapping {
    Thre is no mention of the mapping of commands to the device
    and to these interrupt descriptors. Can anyone supply input
    or pointers to this mapping.

    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.
    }
    I don't really NEED to know this mapping, but knowing would
    significantly enhance my understanding of what is supposed
    to be going on, and thus avoid making crippling errors.

    b) address space of interrupt service port {
    The address in the interrupt descriptor points at a service
    port (APIC). Since a service port is "not like memory"*, I
    want to mandate this aqddress be in MMI/O space, and since
    My 66000 has a full 64-bit address space for MMI/O there is
    no burden on the size of MMI/O space--it is already as big
    as possible on a 64-bit machine. Plus, MMI/O space has the
    property of being sequentially consistent whereas DRAM is
    only cache consistent.

    Most current architectures just partition a hunk of the
    physical address space as MMI/O address space.

    (*) memory has the property that a read will return the last
    bit pattern written, a service port does not.

    I assume that service port addresses map to different cores
    (or local APICs of a core). I want to directly support the
    notion of a virtual core so while a 'chip' might have a large
    number of physical cores, one would want a pool of thousands+
    of virtual cores. I want said service ports to support raising
    interrupt directly to a physical or virtual core.
    }

    Apparently, the message part of the MSI-X interrupt can be
    interpreted any way that both SW and HW agree. This works
    for already defined architectures, and doing it like one
    or more others, makes an OS port significantly easier.
    However what these messages contain is difficult to find
    via Google.

    So, it seems to me, that the combination of the 64-bit address
    and the 32-bit message must provide::
    a) which level of the system to interrupt
    {Secure Monitor, HyperVisor, SuperVisor, Application}
    b) which core should handle the interrupt
    {physical[0..k], virtual[l..m]}
    c) what priority level is the interrupt.
    {There are 64 unique priority levels}
    d) something about why the interrupt was raised
    {what remains of the meassage}

    I suspect that (a) and (b) are parts of the address while (c)
    and (d) are part of the message. Although nothing prevents
    (c) from being part of the address.

    Once MSI-X is sorted out MSI becomes a subset.

    HostBridge has a service port that provides INT[A,B,C,D] to
    MSI-X translation, so only MSI-X message are used system-
    wide.

    ------------------------------------------------------------

    It seems to me that the interrupt address needs translation
    via I/O MMU, but which of the 4 levels provides the trans-
    lation Root pointers ??

    Am I allowed to use bits in Vector Control to provide this ??
    But if I put it there then there is cross privilege leakage !

    c) interupt latency {
    When "what is running on a core" is timesliced by a HyperVisor,
    a core that launched a command to a device may not be running
    at the instant the interrupt arrives back.

    It seems to me, that the HyperVisor would want to perform ISR
    processing of the interrupt (low latency) and then schedule
    the softIRQs to the <sleeping> core so when it regains control
    the pending I/O stack of "stuff" is proprly cleaned up.

    So, shold all initerrupt simple go to HyperVisor and let HV
    sort it all out? Or can the <sleeping> virtual core just deal
    with it when it is given a next time slice ??

    Now, if there were a way to cascade interrupts such that if
    an interrupt was routed to a <sleeping virtual core> that
    some kind of "poke in the side" of a HyperVisor would cause
    HV to find a next times slice for the <sleeping> core ex post
    haste, and just let the core deal with the interrupt !!
    Presto, any privilege level can handle its own interrupts.
    }

    Comments ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jun 21 22:00:56 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    PCIe has an MSI-X interrupt 'capabillity' which consists of
    a number (n) interrupt desctiptors and an associated Pending
    Bit Array where each bit in PBA has a corresponding 128-bit
    desctiptor. A descriptor contains a 64-bit address, a 32-bit
    message, and a 32-bit vector control word.

    There are 2-levels of enablement, one at the MSI-X configura-
    tion control register and one in each interrupt descriptor at
    vector control bit[31].

    As the device raises an interrupt, it sets a bit in PBA.

    When MSI-X is enabled and a bit in PBA is set (1) and the
    vector control bit[31] is enabled, the device sends a
    write of the message to the address in the descriptor,
    and clears the bit in PBA.

    Note that if the interrupt condition is asserted after the
    global enable in the MSI-X capability and the vector enable
    have both been set to allow delivery, the message will be sent to
    the root complex and PBA will not be updated. (P is for
    pending, and once the message is sent, it's no longer
    pending). PBA is only updated when the interrupt is masked
    (either function-wide in the capability or per-vector).



    I am assuming that the MSI-X enable bit is used to throttle

    In my experience the MSI-X function enable and vector enables
    are not modified during runtime, rather the device has control
    registers which allow masking of the interrupt (e.g.
    for AHCI, the MSI message will only be sent if the port
    PxIE (Port n Interrupt Enable) bit corresponding to a
    PxIS (Port n Interrupt Status) bit is set).

    Granted, AHCI specifies MSI, not MSI-X, but every MSI-X
    device I've worked with operates the same way, with
    device specific interrupt enables for a particular vector.

    a device so that it sends bursts of interrupts to optimize
    the caching behavior of the cores handling the interrupts.
    run applications->handle k interrupts->run applications.
    A home machine would not use this featrue as the interrupt
    load is small, but a GB server might more control over when.
    But does anybody know ??

    Yes, we use MSI-X extensively. See above.

    There are a number of mechanisms used for interrupt moderation,
    but all generally are independent of the PCI message delivery.
    (e.g. RSS spreads interrupts across multiple target cores,
    or the Intel 10Ge network adapters interrupt moderation feature).


    a) device dommand to interrupt descriptor mapping {
    Thre is no mention of the mapping of commands to the device
    and to these interrupt descriptors. Can anyone supply input
    or pointers to this mapping.

    Once the message leaves the device, is received by the
    root complex port and is forwarded across the host bridge
    to the system fabric, it's completely under control of
    the host. On x86, the TLP for the upstream message is
    received and forwarded to the specified address (which is
    the IOAPIC on Intel and the GIC ITS on Arm64).

    The interrupt controller may further mask the interrupt if
    desired or if the interrupt priority is lower than the
    current running priority.


    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.

    Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller
    MSI configuration can provide one vector per port - the main
    difference between MSI and MSI-X is that the interrupt numbers
    for MSI must be consecutive and there is only one address;
    while for MSI-X each vector has an unique address and a programmable
    data (interrupt number) field. The interpretation of the data
    of the MSI-X or MSI upstream write is up to the interrupt controller
    and may be virtualized in the interrupt controller.

    Note that support for MSI in AHCI is optional (in which case the
    legacy level sensitive PCI INTA/B/C/D signals are used).

    The AHCI standard specification (ahci_1_3.pdf) is available publically.

    }
    I don't really NEED to know this mapping, but knowing would
    significantly enhance my understanding of what is supposed
    to be going on, and thus avoid making crippling errors.

    b) address space of interrupt service port {
    The address in the interrupt descriptor points at a service
    port (APIC). Since a service port is "not like memory"*, I
    want to mandate this aqddress be in MMI/O space, and since
    My 66000 has a full 64-bit address space for MMI/O there is
    no burden on the size of MMI/O space--it is already as big
    as possible on a 64-bit machine. Plus, MMI/O space has the
    property of being sequentially consistent whereas DRAM is
    only cache consistent.

    From the standpoint of the PCIexpress root port, the upstream write
    generated by the device to send the MSI message to the host
    looks just like any other inbound DMA from the device to the
    host. It is the responsibility of the host bridge and interconnect to
    route the message the appropriate destination (which generally
    is an interrupt controller, but just as legally could be a
    DRAM address which software polls periodically).


    Most current architectures just partition a hunk of the
    physical address space as MMI/O address space.

    The address field in the MSI-X vector (or MSI-X capability)
    is opaque to hardware below the PCIe root port.

    Our chips recognize the interrupt controller range of
    addresses in the inbound message at the host bridge
    and route the message to the interrupt translation service;
    the destinations in the interrupt controller are simply
    control and status registers in the MMIO space. The
    ARM64 interrupt controller supports multiple destinations
    with different semantics (SPI and xSPI have one target
    register and LPI has a different target register the address
    of which is programmed into the MSI-X Vector address field).



    (*) memory has the property that a read will return the last
    bit pattern written, a service port does not.

    I assume that service port addresses map to different cores
    (or local APICs of a core).

    The IOAPIC handles the message and has configuration registers
    that determine which lAPIC should be signalled.

    The GIC has configuration tables in memory that can remap
    the interrupt to a different vector (e.g. for a guest VM).


    I want to directly support the
    notion of a virtual core so while a 'chip' might have a large
    number of physical cores, one would want a pool of thousands+
    of virtual cores. I want said service ports to support raising
    interrupt directly to a physical or virtual core.

    Take a look at IHI0069 (https://developer.arm.com/documentation/ihi0069/latest/)

    }

    Apparently, the message part of the MSI-X interrupt can be
    interpreted any way that both SW and HW agree.

    Yes.

    This works
    for already defined architectures, and doing it like one
    or more others, makes an OS port significantly easier.
    However what these messages contain is difficult to find
    via Google.

    The message is a 32-bit field and it is fully interpreted by
    the interrupt controller (The GIC can be configured to support
    from 16 to 32-bits data payload in an upstream MSI-X write;
    the interpretation of the data is host specific).

    On intel and ARM systems, the firmware knows the grungy details
    and simply passes the desired payload value to the kernel
    via the device tree(linux) or ACPI tables (for windows/linux).

    So, it seems to me, that the combination of the 64-bit address
    and the 32-bit message must provide::
    a) which level of the system to interrupt
    {Secure Monitor, HyperVisor, SuperVisor, Application}

    No. That's completely a function of the interrupt controller
    and how the hardware handles the data payload.

    b) which core should handle the interrupt
    {physical[0..k], virtual[l..m]}

    Again, a function of the interrupt controller.

    c) what priority level is the interrupt.
    {There are 64 unique priority levels}

    Yep, a function of the interrupt controller.

    d) something about why the interrupt was raised

    The interrupt itself causes the operating system
    device driver interrupt function to be invoked. The
    device-specific interrupt handler determines both
    why the interrupt was raised (e.g. via the PxIS
    register in the AHCI/SATA controller) and takes
    the appropriate action.

    On ARM64, it is common for the data field for
    the MSI-X interrupts to number starting at zero
    on every device, and they're mapped to a system-wide
    unique value by the interrupt controller (e.g.
    the GICv4 ITS). If interrupt remapping hardware is
    not available then unique data payloads for each
    device need to be used.

    Note that like any other inbound DMA, the address
    in the MSI-X TLP that gets sent to the host bridge is subject
    to translation by an IOMMU before getting to the
    interrupt controller (or by the device itself if it
    supports PCI-e Address Translation Services (ATS)).



    {what remains of the meassage}

    I suspect that (a) and (b) are parts of the address while (c)
    and (d) are part of the message. Although nothing prevents
    (c) from being part of the address.

    Once MSI-X is sorted out MSI becomes a subset.

    HostBridge has a service port that provides INT[A,B,C,D] to
    MSI-X translation, so only MSI-X message are used system-
    wide.

    Note that INTA/B/C/D are level-sensitive. This requires
    TWO MSI-X vectors - one that targets an "interrupt set"
    register and the other targets and "interrupt clear"
    register.


    ------------------------------------------------------------

    It seems to me that the interrupt address needs translation
    via I/O MMU, but which of the 4 levels provides the trans-
    lation Root pointers ??

    On Intel the IOMMU translation tables are not shared with the
    AP.

    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    This allows each device capable of inbound DMA to identify
    themselves uniquely to the interrupt controller and IOMMU.

    Both intel and AMD use this convention.


    Am I allowed to use bits in Vector Control to provide this ??
    But if I put it there then there is cross privilege leakage !

    No, you're not allowed to do anything not explicitly allowed
    in the PCI express specification. Remember, an MSI-X write
    generated by the device is indistinguishable from any other
    upstream DMA request initiated by the device.


    c) interupt latency {
    When "what is running on a core" is timesliced by a HyperVisor,
    a core that launched a command to a device may not be running
    at the instant the interrupt arrives back.

    See again the document referenced above. The interrupt controller
    is aware that the guest is not currently scheduled and maintains
    a virutal pending state (and can optionally signal the hypervisor
    that the guest should be scheduled ASAP).

    Most of this is done completely by the hardware, without any
    intervention by the hypervisor for the vast majority of
    interrupts.



    It seems to me, that the HyperVisor would want to perform ISR
    processing of the interrupt (low latency) and then schedule
    the softIRQs to the <sleeping> core so when it regains control
    the pending I/O stack of "stuff" is proprly cleaned up.

    So, shold all initerrupt simple go to HyperVisor and let HV
    sort it all out? Or can the <sleeping> virtual core just deal
    with it when it is given a next time slice ??

    The original GIC did something like this (the HV took all
    interrupts and there was a hardware mechanism to inject them
    into a guest as if they were a hardware interrupt). But
    it was too much overhead going through the hypervisor, especially
    when the endpoint device support the SRIOV capability. So the
    GIC supports handling virtual interrupt delivery completely
    in hardware unless the guest is not currently resident on any
    virtual CPU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Scott Lurndal on Fri Jun 21 22:28:19 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:


    When MSI-X is enabled and a bit in PBA is set (1) and the
    vector control bit[31] is enabled, the device sends a
    write of the message to the address in the descriptor,
    and clears the bit in PBA.

    Note that if the interrupt condition is asserted after the
    global enable in the MSI-X capability and the vector enable
    have both been set to allow delivery, the message will be sent to
    the root complex and PBA will not be updated. (P is for
    pending, and once the message is sent, it's no longer
    pending). PBA is only updated when the interrupt is masked
    (either function-wide in the capability or per-vector).

    These are the gates to interrupt delivery on a typical
    ARM-based system, from closest to the device to furthest.

    1) The device interrupt enable register (e.g. AHCI P0IE)
    2) The MSI-X Vector enable (in each vector control register)
    3) The MSI-X PCI-Express Capability enable (MSI-X enable and
    function mask in the MSI-X Capability message control field)
    4) The PCI configuration space COMMAND register [BME]
    (bus master enable) bit must be set

    These first four steps are handled by the PCI endpoint hardware
    before posting the upstream write TLP. If (2), (3) or (4)
    conditions do not hold, then the PBA bit will be set and
    the message will be sent when the conditions allow. If (1)
    does not hold, then the device status register will hold the
    state until the interrupt is unmasked in the device.

    5) The interrupt controller per-interrupt enable bit(s)
    (for GIC: the SPI, eSPI enable registers, indexed by interrupt
    number, or the LPI properties byte enable bit, indexed
    into DRAM table by LPI number (range 8192 - 2^24)). SPI's
    are generally used for level-sensitive or latency sensitive
    interrupts and are implemented as wires)
    6) The interrupt group enable. Interrupts are grouped by delivery
    mechanism (there are two CPU interrupt signals, IRQ and FIQ)
    and security state.
    7) The Target processor enable (in interrupt controller)
    8) The interrupt priority is greater than any currently
    being processed.
    9) The processor PSR interrupt mask.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sat Jun 22 01:51:27 2024
    One thing to note::

    The PCIe root complex may send out an address, but when the
    MMI/O receiver of said address then uses bits in the address
    for routing purposes or interpretation purposes, it matters
    not that the sender did not know those bits were being used
    thusly--those bits actually do carry specific meaning--just
    not to the sender.

    At some level in architecture, you have to look both ways
    and amalgamate the meanings, such that the common meaning
    is useful looking in either direction.

    Not a miff--just a statement of what architecture is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to interrupt controller that routed on Sat Jun 22 01:12:50 2024
    Scott Lurndal wrote:

    First of all, allow me to express my gratitude in such a well
    though out response, compared to the miscellaneous ramblings
    going on in my head.

    mitchalsup@aol.com (MitchAlsup1) writes:
    PCIe has an MSI-X interrupt 'capabillity' which consists of
    a number (n) interrupt desctiptors and an associated Pending
    Bit Array where each bit in PBA has a corresponding 128-bit
    desctiptor. A descriptor contains a 64-bit address, a 32-bit
    message, and a 32-bit vector control word.

    There are 2-levels of enablement, one at the MSI-X configura-
    tion control register and one in each interrupt descriptor at
    vector control bit[31].

    As the device raises an interrupt, it sets a bit in PBA.

    When MSI-X is enabled and a bit in PBA is set (1) and the
    vector control bit[31] is enabled, the device sends a
    write of the message to the address in the descriptor,
    and clears the bit in PBA.

    Note that if the interrupt condition is asserted after the
    global enable in the MSI-X capability and the vector enable
    have both been set to allow delivery, the message will be sent to
    the root complex and PBA will not be updated. (P is for
    pending, and once the message is sent, it's no longer
    pending). PBA is only updated when the interrupt is masked
    (either function-wide in the capability or per-vector).

    So, the interrupt only becomes pending in BPA if it cannot be
    sent immediately. Thanks for the clarification.


    I am assuming that the MSI-X enable bit is used to throttle

    In my experience the MSI-X function enable and vector enables
    are not modified during runtime, rather the device has control
    registers which allow masking of the interrupt (e.g.
    for AHCI, the MSI message will only be sent if the port
    PxIE (Port n Interrupt Enable) bit corresponding to a
    PxIS (Port n Interrupt Status) bit is set).

    So, these degenerated into more masking levels that are not
    used very often because other masks can be applied elsewhere.

    Granted, AHCI specifies MSI, not MSI-X, but every MSI-X
    device I've worked with operates the same way, with
    device specific interrupt enables for a particular vector.

    a device so that it sends bursts of interrupts to optimize
    the caching behavior of the cores handling the interrupts.
    run applications->handle k interrupts->run applications.
    A home machine would not use this featrue as the interrupt
    load is small, but a GB server might more control over when.
    But does anybody know ??

    Yes, we use MSI-X extensively. See above.

    There are a number of mechanisms used for interrupt moderation,
    but all generally are independent of the PCI message delivery.
    (e.g. RSS spreads interrupts across multiple target cores,
    or the Intel 10Ge network adapters interrupt moderation feature).


    a) device dommand to interrupt descriptor mapping {
    Thre is no mention of the mapping of commands to the device
    and to these interrupt descriptors. Can anyone supply input
    or pointers to this mapping.

    Once the message leaves the device, is received by the
    root complex port and is forwarded across the host bridge
    to the system fabric, it's completely under control of
    the host. On x86, the TLP for the upstream message is
    received and forwarded to the specified address (which is
    the IOAPIC on Intel and the GIC ITS on Arm64).

    The interrupt controller may further mask the interrupt if
    desired or if the interrupt priority is lower than the
    current running priority.

    {note to self:: that is why its a local APIC--it has to be close
    enough to see the core's priority.}

    Question:: Down below you talk of the various interrupt control-
    lers routing an interrupt <finally> to a core. What happens if the
    core has changed its priority by the time the interrupt signal
    arrives, but before it can change the state of the tables in the
    interrupt controller that routed said interrupt here ?


    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.

    Each SATA port has an PxIS and PxIE register. The SATA (AHCI)
    controller
    MSI configuration can provide one vector per port - the main
    difference between MSI and MSI-X is that the interrupt numbers
    for MSI must be consecutive and there is only one address;
    while for MSI-X each vector has an unique address and a programmable
    data (interrupt number) field. The interpretation of the data
    of the MSI-X or MSI upstream write is up to the interrupt controller
    and may be virtualized in the interrupt controller.

    I see (below) that you (they) migrated all the stuff I though might
    be either in the address or data to the "other side" of HostBridge.
    Fair enough.

    For what reason are there multiple addresses ? instead of a range
    of addresses providing a more globally-scoped service port ?
    Perhaps it is an address at the interrupt descriptor, and an
    address range at the global interrupt controller. Where different
    addresses then mean different things.

    Note that support for MSI in AHCI is optional (in which case the
    legacy level sensitive PCI INTA/B/C/D signals are used).

    The AHCI standard specification (ahci_1_3.pdf) is available publically.

    }
    I don't really NEED to know this mapping, but knowing would
    significantly enhance my understanding of what is supposed
    to be going on, and thus avoid making crippling errors.

    b) address space of interrupt service port {
    The address in the interrupt descriptor points at a service
    port (APIC). Since a service port is "not like memory"*, I
    want to mandate this aqddress be in MMI/O space, and since
    My 66000 has a full 64-bit address space for MMI/O there is
    no burden on the size of MMI/O space--it is already as big
    as possible on a 64-bit machine. Plus, MMI/O space has the
    property of being sequentially consistent whereas DRAM is
    only cache consistent.

    From the standpoint of the PCIexpress root port, the upstream write
    generated by the device to send the MSI message to the host
    looks just like any other inbound DMA from the device to the
    host. It is the responsibility of the host bridge and interconnect to
    route the message the appropriate destination (which generally
    is an interrupt controller, but just as legally could be a
    DRAM address which software polls periodically).

    So the message arrive at the top of the PCIe tree is RAW, then
    the address gets translated by I/O MMU, and both translated
    address and RAW data are passed forward to its fate.


    Most current architectures just partition a hunk of the
    physical address space as MMI/O address space.

    The address field in the MSI-X vector (or MSI-X capability)
    is opaque to hardware below the PCIe root port.

    Our chips recognize the interrupt controller range of
    addresses in the inbound message at the host bridge
    and route the message to the interrupt translation service;
    the destinations in the interrupt controller are simply
    control and status registers in the MMIO space. The
    ARM64 interrupt controller supports multiple destinations
    with different semantics (SPI and xSPI have one target
    register and LPI has a different target register the address
    of which is programmed into the MSI-X Vector address field).

    What I am trying to do is to figure out a means to route the
    message to a virtual core's interrupt table such that:: if that
    virtual core happens to be running on any physical core, that
    the physical core sees the interrupt without delay, and if
    the virtual core is not running, the event is properly logged
    so when the virtual core runs on a physical core that those
    ISRs are performed before any lower priority work is performed.

    {and make this work for any number of physical cores and any
    number of virtual cores; where cores can sharing interrupt
    tables. For example, Guest OS[k] thinks that is has 13 cores
    and shares its interrupt table across 5 of them, but HyperVisor
    remains free to time slice Guest OS[k] cores any way it likes.}


    (*) memory has the property that a read will return the last
    bit pattern written, a service port does not.

    I assume that service port addresses map to different cores
    (or local APICs of a core).

    The IOAPIC handles the message and has configuration registers
    that determine which lAPIC should be signalled.

    The GIC has configuration tables in memory that can remap
    the interrupt to a different vector (e.g. for a guest VM).

    GIC = Global Interrupt Controller ?

    I want to directly support the
    notion of a virtual core so while a 'chip' might have a large
    number of physical cores, one would want a pool of thousands+
    of virtual cores. I want said service ports to support raising
    interrupt directly to a physical or virtual core.

    Take a look at IHI0069 (https://developer.arm.com/documentation/ihi0069/latest/)

    }

    Apparently, the message part of the MSI-X interrupt can be
    interpreted any way that both SW and HW agree.

    Yes.

    This works
    for already defined architectures, and doing it like one
    or more others, makes an OS port significantly easier.
    However what these messages contain is difficult to find
    via Google.

    The message is a 32-bit field and it is fully interpreted by
    the interrupt controller (The GIC can be configured to support
    from 16 to 32-bits data payload in an upstream MSI-X write;
    the interpretation of the data is host specific).

    On intel and ARM systems, the firmware knows the grungy details
    and simply passes the desired payload value to the kernel
    via the device tree(linux) or ACPI tables (for windows/linux).

    So, it seems to me, that the combination of the 64-bit address
    and the 32-bit message must provide::
    a) which level of the system to interrupt
    {Secure Monitor, HyperVisor, SuperVisor, Application}

    No. That's completely a function of the interrupt controller
    and how the hardware handles the data payload.

    b) which core should handle the interrupt
    {physical[0..k], virtual[l..m]}

    Again, a function of the interrupt controller.

    c) what priority level is the interrupt.
    {There are 64 unique priority levels}

    Yep, a function of the interrupt controller.

    d) something about why the interrupt was raised

    The interrupt itself causes the operating system
    device driver interrupt function to be invoked. The
    device-specific interrupt handler determines both
    why the interrupt was raised (e.g. via the PxIS
    register in the AHCI/SATA controller) and takes
    the appropriate action.

    On ARM64, it is common for the data field for
    the MSI-X interrupts to number starting at zero
    on every device, and they're mapped to a system-wide
    unique value by the interrupt controller (e.g.
    the GICv4 ITS).

    I was expecting that.

    If interrupt remapping hardware is
    not available then unique data payloads for each
    device need to be used.

    Note that like any other inbound DMA, the address
    in the MSI-X TLP that gets sent to the host bridge is subject
    to translation by an IOMMU before getting to the
    interrupt controller (or by the device itself if it
    supports PCI-e Address Translation Services (ATS)).

    Obviously.

    {what remains of the meassage}

    I suspect that (a) and (b) are parts of the address while (c)
    and (d) are part of the message. Although nothing prevents
    (c) from being part of the address.

    Once MSI-X is sorted out MSI becomes a subset.

    HostBridge has a service port that provides INT[A,B,C,D] to
    MSI-X translation, so only MSI-X message are used system-
    wide.

    Note that INTA/B/C/D are level-sensitive. This requires
    TWO MSI-X vectors - one that targets an "interrupt set"
    register and the other targets and "interrupt clear"
    register.

    Gotcha.


    ------------------------------------------------------------

    It seems to me that the interrupt address needs translation
    via I/O MMU, but which of the 4 levels provides the trans-
    lation Root pointers ??

    On Intel the IOMMU translation tables are not shared with the
    AP.

    I have seen in the past 3 days AP being used to point at a
    random device out on the PCIe tree and of the unprivileged
    application layer. Both ends of the spectrum. Which is your
    usage ?

    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    I use ChipID for the last field in case each chip has its own
    PCIe tree. {Except that the bits are placed elsewhere in the
    address.}

    But (now with the new CXL) instead of allocating 200+ pins
    to DRAM those pins can be allocated to PCIe links; making any
    chip much less dependent on which DRAM technology, which chip-
    to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
    and about the only other pins chip gets are RESET and ClockIn.

    Bunches of these pins can be 'configured' into standard width
    PCIe links (at least until one runs out of pins.)

    Given that one has a PCIe root complex with around 256-pins
    available, does one need multiple roots of such a wide tree ?

    This allows each device capable of inbound DMA to identify
    themselves uniquely to the interrupt controller and IOMMU.

    Both intel and AMD use this convention.


    Am I allowed to use bits in Vector Control to provide this ??
    But if I put it there then there is cross privilege leakage !

    No, you're not allowed to do anything not explicitly allowed
    in the PCI express specification. Remember, an MSI-X write
    generated by the device is indistinguishable from any other
    upstream DMA request initiated by the device.

    Why did PCI committee specify a 32-bit container and define the
    use on only 1 bit ?? Or are more bits defined but I just haven't
    run into any literature concerning those ?


    c) interupt latency {
    When "what is running on a core" is timesliced by a HyperVisor,
    a core that launched a command to a device may not be running
    at the instant the interrupt arrives back.

    See again the document referenced above. The interrupt controller
    is aware that the guest is not currently scheduled and maintains
    a virutal pending state (and can optionally signal the hypervisor
    that the guest should be scheduled ASAP).

    Are you using the word 'signal' as LINUX signal delivery, or as
    a proxy for interrupt of some form, or perhaps as an SVC to HV
    of some form ?

    Most of this is done completely by the hardware, without any
    intervention by the hypervisor for the vast majority of
    interrupts.

    That is the goal.


    It seems to me, that the HyperVisor would want to perform ISR
    processing of the interrupt (low latency) and then schedule
    the softIRQs to the <sleeping> core so when it regains control
    the pending I/O stack of "stuff" is proprly cleaned up.

    So, shold all initerrupt simple go to HyperVisor and let HV
    sort it all out? Or can the <sleeping> virtual core just deal
    with it when it is given a next time slice ??

    The original GIC did something like this (the HV took all
    interrupts and there was a hardware mechanism to inject them
    into a guest as if they were a hardware interrupt). But
    it was too much overhead going through the hypervisor, especially
    when the endpoint device support the SRIOV capability. So the
    GIC supports handling virtual interrupt delivery completely
    in hardware unless the guest is not currently resident on any
    virtual CPU.

    Leave HV out of the loop unless something drastic happens.
    I/O completion and I/O aborts are not that drastic.

    Once again, I thank you greatly for your long and informative
    post.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:41:28 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    One thing to note::

    The PCIe root complex may send out an address, but when the
    MMI/O receiver of said address then uses bits in the address
    for routing purposes or interpretation purposes, it matters
    not that the sender did not know those bits were being used
    thusly--those bits actually do carry specific meaning--just
    not to the sender.

    Said routing does not necessarily mean that the bits in
    the addresses are interpreted in any specific way, if the
    routing is done by a set of programmed range registers
    (i.e. range X to X+Y is routed to destination Z,e.g. the
    interrupt controller).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:39:32 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:


    It seems to me that the interrupt address needs translation
    via I/O MMU, but which of the 4 levels provides the trans-
    lation Root pointers ??

    On Intel the IOMMU translation tables are not shared with the
    AP.

    I have seen in the past 3 days AP being used to point at a
    random device out on the PCIe tree and of the unprivileged
    application layer. Both ends of the spectrum. Which is your
    usage ?

    Sorry, hit send accidentally on the prior response.

    AP in our context is 'application processor', i.e. ARMv8 core.


    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    I use ChipID for the last field in case each chip has its own
    PCIe tree. {Except that the bits are placed elsewhere in the
    address.}

    Each root complex needs to be an unique segment. A single
    SRIOV endpoint can consume the entire 8-bit bus space and
    the 8-bit dev/function space. In this context, a root complex
    can be considered a PCI express controller with one or more
    root ports. Each root port should be considered an unique
    'segment'.

    This is for device discovery, which uses the PCI express
    "Extended Configuration Access Method" (aka ECAM) to scan
    the PCI configuration spaces of all PCI ports.



    But (now with the new CXL) instead of allocating 200+ pins
    to DRAM those pins can be allocated to PCIe links; making any
    chip much less dependent on which DRAM technology, which chip-
    to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
    and about the only other pins chip gets are RESET and ClockIn.

    Note that bridging to PCI signalling will increase latency
    somewhat, even with PCIe gen 6.


    Bunches of these pins can be 'configured' into standard width
    PCIe links (at least until one runs out of pins.)

    Given that one has a PCIe root complex with around 256-pins
    available, does one need multiple roots of such a wide tree ?

    You basically need a root per device to accomodate SRIOV
    devices (like enterprise grade network adapters, high-end
    NVMe devices, etc).


    This allows each device capable of inbound DMA to identify
    themselves uniquely to the interrupt controller and IOMMU.

    Both intel and AMD use this convention.


    Am I allowed to use bits in Vector Control to provide this ??
    But if I put it there then there is cross privilege leakage !

    No, you're not allowed to do anything not explicitly allowed
    in the PCI express specification. Remember, an MSI-X write
    generated by the device is indistinguishable from any other
    upstream DMA request initiated by the device.

    Why did PCI committee specify a 32-bit container and define the
    use on only 1 bit ?? Or are more bits defined but I just haven't
    run into any literature concerning those ?

    At the time that MSI and MSI-X were added to the PCI Local Bus
    specification (-before- PCI Express), the devices already had
    local masking - the MSI-X enable bit in the capability is used
    to switch between using legacy INTA/B/C/D and MSI-X so that a
    PCI card could work on systems that didn't support MSI-X.

    The function mask bit in the capability masks the entire function
    (all vectors). I've not seen that used in the real world, myself.

    The vector mask bits mask each individual vector.



    c) interupt latency {
    When "what is running on a core" is timesliced by a HyperVisor,
    a core that launched a command to a device may not be running
    at the instant the interrupt arrives back.

    See again the document referenced above. The interrupt controller
    is aware that the guest is not currently scheduled and maintains
    a virutal pending state (and can optionally signal the hypervisor
    that the guest should be scheduled ASAP).

    Are you using the word 'signal' as LINUX signal delivery, or as
    a proxy for interrupt of some form, or perhaps as an SVC to HV
    of some form ?

    In the case of the ARM GIC, there is a defined processor private
    interrupt that is used to signal the hypervisor - this is what is
    used to 'signal' (not in the unix sense) the condition to the
    hypervisor. PPIs are also used for timer interrupts, statistical
    profiling interrupts, and a few others.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:28:58 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:


    Once the message leaves the device, is received by the
    root complex port and is forwarded across the host bridge
    to the system fabric, it's completely under control of
    the host. On x86, the TLP for the upstream message is
    received and forwarded to the specified address (which is
    the IOAPIC on Intel and the GIC ITS on Arm64).

    The interrupt controller may further mask the interrupt if
    desired or if the interrupt priority is lower than the
    current running priority.

    {note to self:: that is why its a local APIC--it has to be close
    enough to see the core's priority.}

    Question:: Down below you talk of the various interrupt control-
    lers routing an interrupt <finally> to a core. What happens if the
    core has changed its priority by the time the interrupt signal
    arrives, but before it can change the state of the tables in the
    interrupt controller that routed said interrupt here ?

    Speaking for the ARM64 systems that I'm most recently
    familiar with, the concept of priority is associated with
    an interrupt (up to 8-bits worth of priority - an implementation
    of the GIC is allowed to support as few as three bits).

    The interrupt controller is distributed logic; there is a
    component called the 'distributor' and another component called
    the 'redistributor'. The former is global to the system and
    the latter is a per-CPU component. The distributor also contains
    a subsystem called the interrupt translation subsystem (ITS) which
    supports interrupt virtualization.

    The redistributor, being part of the core, handles the delivery
    of an interrupt to the core (specifically asserting either the FIQ
    or IRQ signals that cause entry to the IRQ or FIQ exception
    handlers). The redistributor tracks the current running priority
    (which is directly associated with the priority of the current
    active interrupt, when not processing an interrupt, the current
    running priority is called the IDLE priority and doesn't block
    delivery of any interrupts). The redistributor communicates changes to
    the RPR to the distributor, which will hold any interrupt that
    is not eligable for delivery (for any reason, including lack
    of priority). There is no way for software to change the
    RPR - it only tracks the priority of the currently executing
    interrupt.

    +-------------------------+
    | PCI Device |
    +-------------------------+
    | MSI-X message (address: GITS_TRANSLATER control register)
    | (payload: Interrupt number (0 to N))
    v (sideband: streamid)
    +-------------------------+
    | Interrupt Translation | (DRAM tables: Device, Collection)
    | Service | Lookup streamid in device table.
    | | DT refers to Interrupt Translation Table
    | | Translate inbound payload based on ITT to an LPI
    | | Collection table identifies target core +-------------------------+
    | Internal message from ITS to redistributor for target
    v
    +-------------------------+
    | Redistributor | (DRAM table: LPI properties)
    | | Lookup LPI properties, contains priority and enable bit
    | | If not enabled or priority too low,
    | | store in LPI pending table (also DRAM) [*]
    | | If enabled, unmasked at the CPU interface
    | | and priority higher than RPR, assert FIQ
    | | or IRQ signals to core. +-------------------------+
    | IRQ/FIQ signals
    v
    +-------------------------+
    | Core | Check PSTATE IRQ and FIQ mask bits
    | | IRQ/FIQ can be routed to EL1, EL2 or EL3
    | | by per-core control bits. Update
    | | core state appropriately and enter ISR +-------------------------+

    [*] as core RPR and signal masks change, the ITS re-evaluates pending
    [**] LPI properties and pending bits are generally cached in the
    redistributor for performance.



    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.

    Each SATA port has an PxIS and PxIE register. The SATA (AHCI)
    controller
    MSI configuration can provide one vector per port - the main
    difference between MSI and MSI-X is that the interrupt numbers
    for MSI must be consecutive and there is only one address;
    while for MSI-X each vector has an unique address and a programmable
    data (interrupt number) field. The interpretation of the data
    of the MSI-X or MSI upstream write is up to the interrupt controller
    and may be virtualized in the interrupt controller.

    I see (below) that you (they) migrated all the stuff I though might
    be either in the address or data to the "other side" of HostBridge.
    Fair enough.

    For what reason are there multiple addresses ?

    A system may have multiple interrupt controllers. In the
    case of the ARM64 systems, there may be a case where some
    interrupts should be considered level sensitive, in which
    case they must use SPI type interrupts which have a different
    target register for the MSI-X address field when compared with
    LPI type interrupts.

    Recall that the PCI spec must accomodate a wide range of system
    implementations (including Z-series).



    From the standpoint of the PCIexpress root port, the upstream write
    generated by the device to send the MSI message to the host
    looks just like any other inbound DMA from the device to the
    host. It is the responsibility of the host bridge and interconnect to
    route the message the appropriate destination (which generally
    is an interrupt controller, but just as legally could be a
    DRAM address which software polls periodically).

    So the message arrive at the top of the PCIe tree is RAW, then
    the address gets translated by I/O MMU, and both translated
    address and RAW data are passed forward to its fate.

    Basically, yes. The 'root complex port' is the interface
    beween the host bridge and the endpoint device. A system
    may have a configuration option where the upstream message
    from the root complex can bypass the IOMMU as well (e.g.
    for firmware controlled devices - think SMM).



    What I am trying to do is to figure out a means to route the
    message to a virtual core's interrupt table such that:: if that
    virtual core happens to be running on any physical core, that
    the physical core sees the interrupt without delay, and if
    the virtual core is not running, the event is properly logged
    so when the virtual core runs on a physical core that those
    ISRs are performed before any lower priority work is performed.

    That's exactly what the redistributor does in the ARM GIC.

    It's probably worth reading that document - it would take
    a considerable amount of typing for me to summarize the
    GICv4.x virtualization features :-).


    {and make this work for any number of physical cores and any
    number of virtual cores; where cores can sharing interrupt
    tables. For example, Guest OS[k] thinks that is has 13 cores
    and shares its interrupt table across 5 of them, but HyperVisor
    remains free to time slice Guest OS[k] cores any way it likes.}

    The arm gic supports all this.



    (*) memory has the property that a read will return the last
    bit pattern written, a service port does not.

    I assume that service port addresses map to different cores
    (or local APICs of a core).

    The IOAPIC handles the message and has configuration registers
    that determine which lAPIC should be signalled.

    The GIC has configuration tables in memory that can remap
    the interrupt to a different vector (e.g. for a guest VM).

    GIC = Global Interrupt Controller ?

    Generic, I believe.


    It seems to me that the interrupt address needs translation
    via I/O MMU, but which of the 4 levels provides the trans-
    lation Root pointers ??

    On Intel the IOMMU translation tables are not shared with the
    AP.

    I have seen in the past 3 days AP being used to point at a
    random device out on the PCIe tree and of the unprivileged
    application layer. Both ends of the spectrum. Which is your
    usage ?

    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    I use ChipID for the last field in case each chip has its own
    PCIe tree. {Except that the bits are placed elsewhere in the
    address.}

    But (now with the new CXL) instead of allocating 200+ pins
    to DRAM those pins can be allocated to PCIe links; making any
    chip much less dependent on which DRAM technology, which chip-
    to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
    and about the only other pins chip gets are RESET and ClockIn.

    Bunches of these pins can be 'configured' into standard width
    PCIe links (at least until one runs out of pins.)

    Given that one has a PCIe root complex with around 256-pins
    available, does one need multiple roots of such a wide tree ?

    This allows each device capable of inbound DMA to identify
    themselves uniquely to the interrupt controller and IOMMU.

    Both intel and AMD use this convention.


    Am I allowed to use bits in Vector Control to provide this ??
    But if I put it there then there is cross privilege leakage !

    No, you're not allowed to do anything not explicitly allowed
    in the PCI express specification. Remember, an MSI-X write
    generated by the device is indistinguishable from any other
    upstream DMA request initiated by the device.

    Why did PCI committee specify a 32-bit container and define the
    use on only 1 bit ?? Or are more bits defined but I just haven't
    run into any literature concerning those ?


    c) interupt latency {
    When "what is running on a core" is timesliced by a HyperVisor,
    a core that launched a command to a device may not be running
    at the instant the interrupt arrives back.

    See again the document referenced above. The interrupt controller
    is aware that the guest is not currently scheduled and maintains
    a virutal pending state (and can optionally signal the hypervisor
    that the guest should be scheduled ASAP).

    Are you using the word 'signal' as LINUX signal delivery, or as
    a proxy for interrupt of some form, or perhaps as an SVC to HV
    of some form ?

    Most of this is done completely by the hardware, without any
    intervention by the hypervisor for the vast majority of
    interrupts.

    That is the goal.


    It seems to me, that the HyperVisor would want to perform ISR
    processing of the interrupt (low latency) and then schedule
    the softIRQs to the <sleeping> core so when it regains control
    the pending I/O stack of "stuff" is proprly cleaned up.

    So, shold all initerrupt simple go to HyperVisor and let HV
    sort it all out? Or can the <sleeping> virtual core just deal
    with it when it is given a next time slice ??

    The original GIC did something like this (the HV took all
    interrupts and there was a hardware mechanism to inject them
    into a guest as if they were a hardware interrupt). But
    it was too much overhead going through the hypervisor, especially
    when the endpoint device support the SRIOV capability. So the
    GIC supports handling virtual interrupt delivery completely
    in hardware unless the guest is not currently resident on any
    virtual CPU.

    Leave HV out of the loop unless something drastic happens.
    I/O completion and I/O aborts are not that drastic.

    Once again, I thank you greatly for your long and informative
    post.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sat Jun 22 19:31:20 2024
    Scott Lurndal wrote:

    Again allow me to express my gratitute on the quality of your posts !

    A couple of dumb questions to illustrate how much more I need to
    learn::

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    I use ChipID for the last field in case each chip has its own
    PCIe tree. {Except that the bits are placed elsewhere in the
    address.}

    Each root complex needs to be an unique segment. A single
    SRIOV endpoint can consume the entire 8-bit bus space and
    the 8-bit dev/function space. In this context, a root complex
    can be considered a PCI express controller with one or more
    root ports. Each root port should be considered an unique
    'segment'.

    This is for device discovery, which uses the PCI express
    "Extended Configuration Access Method" (aka ECAM) to scan
    the PCI configuration spaces of all PCI ports.

    Within a 'Chip' there are k cores, 1 last level cache, and
    1 HostBridge with (say) 256 pins at its disposal. Said
    pins can be handed out in powers of 2 of 4-pins each
    so multiple PCIe trees of differing widths emanate from
    the 256-PCIe-pins.

    I guess you are calling each point of emanation a root.
    I just bundle them under 1 HostBridge, and consider how
    the "handing out" is done to be a HostBridge problem.
    But As seen on the on-chip interconnect there is one
    HostBridge which accesses all devices attached to this
    Chip. Basically, I see on-chip-interconnect with one
    HostBridge knowing that the pins will be allocated
    "efficiently" for the attached devices.

    Thanks for the ECAM pointer, that clears up a raft of
    questions.



    But (now with the new CXL) instead of allocating 200+ pins
    to DRAM those pins can be allocated to PCIe links; making any
    chip much less dependent on which DRAM technology, which chip-
    to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
    and about the only other pins chip gets are RESET and ClockIn.

    Note that bridging to PCI signalling will increase latency
    somewhat, even with PCIe gen 6.

    Unavoidable.


    Bunches of these pins can be 'configured' into standard width
    PCIe links (at least until one runs out of pins.)

    Given that one has a PCIe root complex with around 256-pins
    available, does one need multiple roots of such a wide tree ?

    You basically need a root per device to accommodate SRIOV
    devices (like enterprise grade network adapters, high-end
    NVMe devices, etc).

    As noted above: I knew more bits than B:D,F were needed,
    but not which and where. And if a single SR-IOV device
    consumes a whole B:D,F space sobeit. ECAM alignment
    identifies those bits and the routings.


    I guess reading m post backwards I did not pose any questions.

    My thanks again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 20:15:45 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:


    Question:: Down below you talk of the various interrupt control-
    lers routing an interrupt <finally> to a core. What happens if the
    core has changed its priority by the time the interrupt signal
    arrives, but before it can change the state of the tables in the
    interrupt controller that routed said interrupt here ?

    Speaking for the ARM64 systems that I'm most recently
    familiar with, the concept of priority is associated with
    an interrupt (up to 8-bits worth of priority - an implementation
    of the GIC is allowed to support as few as three bits).

    The interrupt controller is distributed logic; there is a
    component called the 'distributor' and another component called
    the 'redistributor'. The former is global to the system and
    the latter is a per-CPU component. The distributor also contains
    a subsystem called the interrupt translation subsystem (ITS) which
    supports interrupt virtualization.

    The redistributor, being part of the core, handles the delivery
    of an interrupt to the core (specifically asserting either the FIQ
    or IRQ signals that cause entry to the IRQ or FIQ exception
    handlers). The redistributor tracks the current running priority
    (which is directly associated with the priority of the current
    active interrupt, when not processing an interrupt, the current
    running priority is called the IDLE priority and doesn't block
    delivery of any interrupts). The redistributor communicates changes to
    the RPR to the distributor, which will hold any interrupt that
    is not eligable for delivery (for any reason, including lack
    of priority). There is no way for software to change the
    RPR - it only tracks the priority of the currently executing
    interrupt.

    Thank you for the wonderful ASCII art::

    +-------------------------+
    | PCI Device |
    +-------------------------+
    | MSI-X message (address: GITS_TRANSLATER control
    register)
    | (payload: Interrupt number (0 to N))
    v (sideband: streamid) +-------------------------+
    | Interrupt Translation | (DRAM tables: Device, Collection)
    | Service | Lookup streamid in device table.
    | | DT refers to Interrupt Translation Table
    | | Translate inbound payload based on ITT to
    an LPI
    | | Collection table identifies target core +-------------------------+
    | Internal message from ITS to redistributor for target
    v
    +-------------------------+
    | Redistributor | (DRAM table: LPI properties)
    | | Lookup LPI properties, contains priority
    and enable bit
    | | If not enabled or priority too low,
    | | store in LPI pending table (also DRAM)
    [*]
    | | If enabled, unmasked at the CPU interface
    | | and priority higher than RPR, assert FIQ
    | | or IRQ signals to core. +-------------------------+
    | IRQ/FIQ signals
    v
    +-------------------------+
    | Core | Check PSTATE IRQ and FIQ mask bits
    | | IRQ/FIQ can be routed to EL1, EL2 or EL3
    | | by per-core control bits. Update
    | | core state appropriately and enter ISR +-------------------------+

    [*] as core RPR and signal masks change, the ITS re-evaluates pending
    [**] LPI properties and pending bits are generally cached in the
    redistributor for performance.

    My concept looks like:: Based on a new understanding of where things
    want to be due to CXL::

    I am assuming that the Last Level Cache (LLC) is placed side by side
    with HostBridge on Chip. This facilitates using PCIe for DRAM access
    and for CXL caches. LLC also provides the access services to the
    I/O MMU (with lots of caching) and maintains the interrupt tables.

    +-------------------------+
    | PCI Device |
    +-------------------------+
    | MSI-X message (address: GITS_TRANSLATER control
    register)
    | (payload: Interrupt number (0 to N))
    v (sideband: streamid) +-------------------------+
    | HostBridge Translation | IOMMU tables: B:D,F->Originating Context
    | Service | Originating Context supplies Root pointers
    | | and interrupt table address
    | In LLC | HostBridge DRAM accesses are performed
    through LLC
    | | HostBridge MMI/O accesses routed out into
    Chip
    +-------------------------+
    | MMI/O message from HTS to virtual context stack DRAM
    address
    |
    | If core interrupt table matches MMI/O address ? SNARF message
    | the message contains pending priority interrupt
    bits.
    v
    +-------------------------+
    | Core | If there is a interrupt at higher priority
    | | than I am currently running ? begin
    interrupt
    | | negotiation (core continues to run
    instructions)
    | | If negotiation is successful ? Claim
    interrupt
    | | and context switch to Interrupt
    Dispatcher.
    +-------------------------+

    There is no IRQ-like signal to the core, it is all done by a SNARF of
    data to an address cores are watching. When a virtual core gets a new
    time slice, as the core is fetching instructions, it also fetches its
    pending priority interrupts from its interrupt table (maintained by
    LLC),
    and will "take" an higher pending interrupt prior to executing any
    instructions at lower priority or lower privilege. Hereinafter, core
    monitors its interrupt table address to SRARF updates.

    Context stack contains pointers to the Thread headers of the 4
    privilege levels, a pointer to the associated interrupt table,
    and some other stuff--it is a cache line in size (8 DoubleWords)

    The pointers to the Thread Headers have access the Root pointers
    of those levels.

    There is a 2-bit indicator in the context stack indicating which
    Root Pointer is used to translate this I/O request.

    All indexed via B:D,F (extended or not) and some rather static tables
    placed in unCacheable DRAM. unCacheable DRAM is actually cached in
    LLC, just not more local to cores. LLC, in essence, SNARFs the
    HostBridge MSI-X message recognizes that this is an update to
    the interrupt tables, inserts update, and then provides a message
    which cores running that interrupt table will SNARF.

    No wires (IRQ), just std messages flying across MMI/O space
    doing exactly the same things.


    For what reason are there multiple addresses ?

    A system may have multiple interrupt controllers. In the
    case of the ARM64 systems, there may be a case where some
    interrupts should be considered level sensitive, in which
    case they must use SPI type interrupts which have a different
    target register for the MSI-X address field when compared with
    LPI type interrupts.

    Recall that the PCI spec must accomodate a wide range of system implementations (including Z-series).

    Would you consider that "multiple interrupt tables all being
    maintained by a single service port inside LLC which then
    spews out updates any/all cores can see" to be multiple
    interrupt controllers ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 20:18:07 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    One thing to note::

    The PCIe root complex may send out an address, but when the
    MMI/O receiver of said address then uses bits in the address
    for routing purposes or interpretation purposes, it matters
    not that the sender did not know those bits were being used
    thusly--those bits actually do carry specific meaning--just
    not to the sender.

    Said routing does not necessarily mean that the bits in
    the addresses are interpreted in any specific way, if the
    routing is done by a set of programmed range registers
    (i.e. range X to X+Y is routed to destination Z,e.g. the
    interrupt controller).



    I was asking the contrapositive::

    Is a system architecture allowed to define certain bits of
    the translated address to be used as either routing or
    indexing of a table that provides routing information.

    Not as seen by request originator or request target, but
    by the middle-men of transport ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 20:21:45 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Again allow me to express my gratitute on the quality of your posts !

    A couple of dumb questions to illustrate how much more I need to
    learn::

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    The PCI address (aka Stream ID) is passed to the interrupt
    controller and IOMMU and used as an index to determine the
    page table root pointer.

    The stream id format is

    <2:0> PCI function number
    <7:3> PCI device number
    <15:8> PCI bus number
    <xx:16> PCI segment (root complex) number.

    I use ChipID for the last field in case each chip has its own
    PCIe tree. {Except that the bits are placed elsewhere in the
    address.}

    Each root complex needs to be an unique segment. A single
    SRIOV endpoint can consume the entire 8-bit bus space and
    the 8-bit dev/function space. In this context, a root complex
    can be considered a PCI express controller with one or more
    root ports. Each root port should be considered an unique
    'segment'.

    This is for device discovery, which uses the PCI express
    "Extended Configuration Access Method" (aka ECAM) to scan
    the PCI configuration spaces of all PCI ports.

    Within a 'Chip' there are k cores, 1 last level cache, and
    1 HostBridge with (say) 256 pins at its disposal. Said
    pins can be handed out in powers of 2 of 4-pins each
    so multiple PCIe trees of differing widths emanate from
    the 256-PCIe-pins.

    Let us start by considering a PCI express device. Electrically
    it is connected to a PCI Express controller instance. The
    controller is responsible for the transport layer, link
    layer and other portions of the PCI express protocols;
    including translating PCI Express transaction layer
    packets (TLPs) into host bus transactions which are
    bridged to the SoC fabric (xbar, ring, mesh, et alia).

    An instance of a PCI express controller is called a
    Root Complex, and supports one or more Root Ports.
    Each root port is electrically connected to a endpoint
    or to a mainboard slot into which an endpoint can
    be inserted. The controller manages link training
    between the root port and the device (primarily for
    plug-in devices) and provides interfaces to the three
    PCI address spaces:

    * Configuration (4096 bytes per function)
    * Memory (2^64 maximum size)
    * I/O (64KB - legacy for Intel IN/OUT instructions)

    The PCI I/O address space is deprecated and not used on modern PCI Express devices. However, a PCI controller is allowed to present
    the endpoint I/O space as a region mapped into the physical
    address space of the host; the PCI controller will convert
    accesses to those physical addresses to IO space TLPs
    when posting downstream transactions to the IO space for
    legacy PCI cards.

    The PCI memory space is a 32 or 64-bit address space decoupled
    from the host address space (although it is often mapped 1:1 with
    the host address space, it isn't required to be if the PCI
    controller instance has the ability to remap the address when
    creating the downstream TLP).

    The PCI configuration space contains control, status and
    discovery registers that define the device. The first
    four bytes of the PCI configuration space contain a 16-bit
    VENDORID and a 16-bit DEVICEID field. These are read
    by the operating system and used to select (and load if
    necessary) the driver that handles that type of device.

    The PCI configuration space also contains base address
    registers (BARs) which describe the amount of address
    space that the function consumes. The host programs
    a base address into the BAR registers during initialization
    (which on intel will be the same physical address range
    in the host physical address space). This dates back to
    the bus-based legacy PCI where all the functions on the
    bus would see the transaction and needed to capture
    that transaction (by matching the BAR register(s)).

    With the point-to-point nature of PCI Express, the
    BARs are primarily used for sizing the aperture and
    the values written may or may not correspond to the
    host physical address mapped to the aperture (this
    mapping is generally implementation specific).

    To size a bar, write all-ones to the bar register(s)
    and read the value back. Unimplemented bits will read as
    zero. Invert the value, add one, and you have the
    required size of the aperture for that BAR.

    The configuration space also contains two lists that
    describe optional capabilities of a function. There
    is a list of legacy PCI capabilities (MSI and MSI-X
    fall into this bucket, as does the PCI Express
    capability which marks the device as PCIe rather than
    legacy PCI). For legacy PCI, the configuration space
    was 256 bytes (and the legacy capabilities all reside
    there). PCIe extended it to 4096 bytes and there is
    a 16-bit pointer at offset 0x100 that is the head of
    the list of PCI express capabilities (which include
    link training stuff, SR-IOV, error reporting capabilities,
    power management, etc).

    The MSI-X capability includes a couple registers that
    locate the MSI-X Vector and Pending arrays - these have
    a 3 bit (bar indicator) that selects which BAR holds
    the MSI-X registers, and an offset value is applied
    to the bar to get to the first vector for the vector
    array and the first bit for the PB array.

    The configuration space is not directly accessible to
    the host - in legacy PCI on Intel systems, the
    southbridge had two registers in the IO space (cf8 and cfc)
    that functioned as a peek/poke mechanism to access
    the configuration space. PCI Express defined a mechanism
    that allows a host to map the PCI configuration space
    into the physical address space directly (called ECAM
    and referred to earlier). The CF8/CFC mechanism
    uses the device address (bus, device, function, aka
    requster id) programmed by software into CF8 along
    with the offset within the 4k space and then reads/writes
    CFC to access the data at that address. For ECAM
    accesses, there is a base address in the host physical
    address space that maps the entire 'root port' configuration
    space, addressed as (bus << 20) | (dev << 15) | (func << 12) | config-offset from the base address.

    The discovery process starts with software reading the
    first 32-bits of each 4K region - on legacy bus-based
    PCI, the read would timeout and the controller would
    abort it (called master abort) and return a value of
    all ones as the result of the read. PCIe requires
    the same bahavior.

    If software reads 0xffffffff for the first 32-bits, it
    adds 4k to the address and tries the next function.

    The PCI function field in the RID is 3 bits, so a device
    can support up to 8 functions. While legacy PCI supported
    32 devices on a bus, PCI Express limits the device number
    of an endpoint to zero, so downstream from the root port
    a given bus will generally have no more than 8 functions.

    The discovery process continues until all 256 buses
    below the root port have been scanned. Note that the
    root port contains a PCI-to-PCI bridge, which may have
    integrated endpoints (RCiep) provided by the root complex;
    which show up as devices or functions on bus 0. The
    bridge forwards transactions downstream to bus 1 (usually,
    but the bus numbers are programmable) which contains the
    endpoint device.

    During discovery, if a device advertises the PCI Express
    SRIOV (Single Root I/O Virtualization) capability, the
    device driver needs to configure the SRIOV functionality
    including the number of virtual functions exposed by
    the device (each being assigned to a guest). SRIOV
    supports up to 65535 virtual functions, which consumes
    the entire 256-bus space on that root port.


    I guess you are calling each point of emanation a root.

    More specifically the Root Complex is the PCI express
    controller (e.g. Synopsys has PCIe controller IP). The
    Root _Port_ is the physical connection to the endpoint
    (or to a PCIe switch, but let's not go there now).

    Often there is only one Root Port per Root Complex,
    but the specification allows for multiple ports.

    I just bundle them under 1 HostBridge, and consider how
    the "handing out" is done to be a HostBridge problem.
    But As seen on the on-chip interconnect there is one
    HostBridge which accesses all devices attached to this
    Chip. Basically, I see on-chip-interconnect with one
    HostBridge knowing that the pins will be allocated
    "efficiently" for the attached devices.


    You basically need a root per device to accommodate SRIOV
    devices (like enterprise grade network adapters, high-end
    NVMe devices, etc).

    As noted above: I knew more bits than B:D,F were needed,
    but not which and where. And if a single SR-IOV device
    consumes a whole B:D,F space sobeit. ECAM alignment
    identifies those bits and the routings.

    Ah, I should have read this one backwards :-)



    I guess reading m post backwards I did not pose any questions.

    My thanks again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 21:29:09 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    One thing to note::

    The PCIe root complex may send out an address, but when the
    MMI/O receiver of said address then uses bits in the address
    for routing purposes or interpretation purposes, it matters
    not that the sender did not know those bits were being used
    thusly--those bits actually do carry specific meaning--just
    not to the sender.

    Said routing does not necessarily mean that the bits in
    the addresses are interpreted in any specific way, if the
    routing is done by a set of programmed range registers
    (i.e. range X to X+Y is routed to destination Z,e.g. the
    interrupt controller).



    I was asking the contrapositive::

    Is a system architecture allowed to define certain bits of
    the translated address to be used as either routing or
    indexing of a table that provides routing information.

    Not as seen by request originator or request target, but
    by the middle-men of transport ?

    From the standpoint of the PCI specification, the host
    side is completely unspecified. You could, for example,
    use bits <63:60> to specify the socket, or chiplet that
    the address should be routed to. Other bits may encode
    the PCI controller #, interrupt controller, IOMMU, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sat Jun 22 22:14:17 2024
    Scott Lurndal wrote:

    Snipping a whole lot of information I had a basic knowledge
    of but were not part of original tree of question being ask.

    But thanks for the details--they help a lot.

    You basically need a root per device to accommodate SRIOV
    devices (like enterprise grade network adapters, high-end
    NVMe devices, etc).

    As noted above: I knew more bits than B:D,F were needed,
    but not which and where.

    Or even the name of what I am searching Google for.....
    ECAM for example.

    And if a single SR-IOV device
    consumes a whole B:D,F space sobeit. ECAM alignment
    identifies those bits and the routings.

    Ah, I should have read this one backwards :-)

    You know, sometimes when reading and writing these posts
    What I need to write changes with my knowledge base and
    some of the earlier writings become stale wrt what I now
    grasp.

    My thanks again.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 22:46:51 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    One thing to note::

    The PCIe root complex may send out an address, but when the
    MMI/O receiver of said address then uses bits in the address
    for routing purposes or interpretation purposes, it matters
    not that the sender did not know those bits were being used >>>>thusly--those bits actually do carry specific meaning--just
    not to the sender.

    Said routing does not necessarily mean that the bits in
    the addresses are interpreted in any specific way, if the
    routing is done by a set of programmed range registers
    (i.e. range X to X+Y is routed to destination Z,e.g. the
    interrupt controller).



    I was asking the contrapositive::

    Is a system architecture allowed to define certain bits of
    the translated address to be used as either routing or
    indexing of a table that provides routing information.

    Not as seen by request originator or request target, but
    by the middle-men of transport ?

    From the standpoint of the PCI specification, the host
    side is completely unspecified. You could, for example,
    use bits <63:60> to specify the socket, or chiplet that
    the address should be routed to. Other bits may encode
    the PCI controller #, interrupt controller, IOMMU, etc.

    Yes, I have information contained in PTEs that convert a std
    LD or ST into a KNOWN configuration space access or Memory
    mapped I/O space access--known inside the core*, and understood
    by the on-die-interconnect to route said request to addressed
    device (or at least HostBridge where it, then, figures out
    where the device is after it has been configured.) During
    this "figuring out" if bits have to be moved about--well that
    is a typical HW problem that HW has various mostly cheap
    solution for. {{Like concatenating the fields of a x86
    segment register into an linear address.}}

    At the time of this discussion, I am working out how all
    the middlemen between cores and (effectively endpoints)
    use the available bits to send stuff where it needs sent.
    I know these will not be like those of other architectures
    a) because I have no legacy to match
    b) because I am trying out "new stuff"
    c) there is no concept of wires (INTx) all of these have
    been mapped into messages in MMI/O space.

    New ways of connecting the dots that should be enough like
    what other guys are doing that Linux porting is no harder
    than necessary, but novel enough to require "even fewer"
    excursions through the HyperVisor to get the dots connected
    and maintain those connections.

    (*) configuration space accesses are known within the core
    because they follow strong ordering, while memory mapped I/O
    accesses are known within the core because they are sequen-
    tially consistent, unlike DRAM accesses which are only
    cache consistent (except when ATOMIC stuff is going on
    where the core drops back to sequentially consistent.)
    This knowledge of which space enables higher performing
    memory systems that automagically drop back to SC when
    it matters and without Fence instructions being needed.

    Since the core knows the space of the access, so does the
    interconnect and orders things appropriately. From the core
    end of looking at things, a configuration request is
    properly ordered and delivered reliably to the endpoint;
    A MMI/O request is properly ordered on the interconnect
    and reliably delivered to endpoint. Likewise, endpoint
    requests are reliably delivered to MMI/O or DRAM address
    spaces. {{given a "special" device as some endpoint, and
    with the already defined facilities, said device could
    go out and read/write core control registers without
    the data ever passing through memory (security stuff).
    Of course one would want said device to be very secure
    indeed in order to trust it that far....but electrically
    is has to be at some "normal place" accessible via normal
    protocol and transports.}}

    It may seem that I have dumped a lot of requirements on
    the last level cache. This may be true--but with the
    advent of CXL, DRAM may migrate farther away from the
    cores to the point where no pins on the chip are dedicated
    to DRAM control, instead a PCIe channel to a DRAM control-
    ler down the PCIe tree(s) allows the Chip to connect to
    any DRAM technology (DDR4,5,6, HBM, RamBus, ...) any size
    of DRAM, any position of DRAM,... just by changing the
    popcorn part at the end of the tree.

    Thus, LLC has to be able to read/write that DRAM and
    CXL caches and then cache its results as is normally
    expected. CXL caches extends this to SRAM in addition
    to DRAM--its just a different popcorn part.

    {{There are also no need to build chip-repeaters
    like HyperTransport or whatever Intel call their similar
    chip-to-chip transport. These will migrate to CXL for
    all the right reason................................}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Mon Jun 24 14:50:34 2024
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.

    Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller MSI configuration can provide one vector per port - the main
    difference between MSI and MSI-X is that the interrupt numbers
    for MSI must be consecutive and there is only one address;
    while for MSI-X each vector has an unique address and a programmable
    data (interrupt number) field. The interpretation of the data
    of the MSI-X or MSI upstream write is up to the interrupt controller
    and may be virtualized in the interrupt controller.

    Note that support for MSI in AHCI is optional (in which case the
    legacy level sensitive PCI INTA/B/C/D signals are used).

    The AHCI standard specification (ahci_1_3.pdf) is available publically.

    What happens to SATA tagged command queuing with SRIOV?
    The tag mapping would seem to interact with virtual interrupts.

    SATA allows up to I think 8 commands queued at once, each with its own
    tag number, which can be performed in any order. That tag indicates
    which DMA mapping scatter gather set to use and is used to identify
    which IO's are complete. A single interrupt can indicate multiple
    tags are complete.

    In native (non-virtualized) use the device driver assigns a free tag
    number to an IO, sets up the DMA scatter/gather list for that tag,
    and on completion interrupt, for each done tag it tears down the
    DMA scatter/gather list, frees that tag number,
    and completes the associated individual IO.

    How would that work for SATA on SRIOV? It has to tart up a set of
    virtual tags for each virtual device and multiplex them among
    multiple virtual devices onto the device physical tag set.
    Also each virtual disk device would need to have its own partition base
    and range on the physical disk and the SRIOV port would offset the
    block numbers into the correct partition range.
    On completion interrupt it has to map the physical tag back to the virtual
    one and trigger a virtual interrupt to the initiating virtual device.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Jun 24 20:32:04 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    A single device (such as a SATA drive) might have a queue of
    outstanding commands that it services in whatever order it
    thinks best. Many of these commands want to inform some core
    when the command is complete (or cannot be completed). To do
    this, device sends a stored interrupt messages to the stored
    service port.

    Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller >> MSI configuration can provide one vector per port - the main
    difference between MSI and MSI-X is that the interrupt numbers
    for MSI must be consecutive and there is only one address;
    while for MSI-X each vector has an unique address and a programmable
    data (interrupt number) field. The interpretation of the data
    of the MSI-X or MSI upstream write is up to the interrupt controller
    and may be virtualized in the interrupt controller.

    Note that support for MSI in AHCI is optional (in which case the
    legacy level sensitive PCI INTA/B/C/D signals are used).

    The AHCI standard specification (ahci_1_3.pdf) is available publically.

    What happens to SATA tagged command queuing with SRIOV?
    The tag mapping would seem to interact with virtual interrupts.

    The AHCI specification dates back to legacy PCI, and the
    MSI support is optional. Tagged queueing, if I recall
    correctly, came later and required drive support.

    The host interrupt handler would
    need to be prepared to poll all the ports for activity
    when invoked.


    SATA allows up to I think 8 commands queued at once, each with its own
    tag number, which can be performed in any order. That tag indicates
    which DMA mapping scatter gather set to use and is used to identify
    which IO's are complete. A single interrupt can indicate multiple
    tags are complete.

    In native (non-virtualized) use the device driver assigns a free tag
    number to an IO, sets up the DMA scatter/gather list for that tag,
    and on completion interrupt, for each done tag it tears down the
    DMA scatter/gather list, frees that tag number,
    and completes the associated individual IO.

    How would that work for SATA on SRIOV?

    There is no standard for AHCI that supports SR-IOV that I'm
    aware of. NVMe does have SR-IOV support.

    Without SR-IOV, the hypervisor must be the only
    entity that communicates with the AHCI controller
    and a paravirtualization (linux virtio) driver is
    provided to the guest for storage device access.

    The NVMe controller hardware interface was
    designed to fix many of the shortcomings of the
    AHCI implementation, particularly with respect
    to virtualization.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Jun 25 00:50:46 2024
    Scott,

    Can you foresee any problem is I define the positions in the
    physical address space as: <ECAM: My 66000-style>::

    <1:0> must be 00 for word access
    <2:7> standard MMI/O register
    <12:8> extended MMI/O register
    <25:13> growth space for registers or for functions
    <32:26> PCIe Device, Function
    <40:32> PCIe Bus
    <56:41> PCIe Segment
    <63:57> Chip

    This effectively gives each B;D,F a 25-bit address space and
    65K segments and up to 32 chips on a motherboard. Need more
    space for functions? take as many bits as you like from the
    left hand side. Need more register room? take bits from the
    right hand side. Need more bits for Chip? Steal them from
    PCIe segment.

    I wanted to move B;D,F up a bit to separate it from the I/O
    registers which will likely come out of a memory reference
    immediate, and I wanted to position B from D,F across a MMU
    translation level boundary.

    I am expecting the code touching the MMI/O register to have
    a virtual address pointer to B;D,F and use the 16-bit immediate
    field of the LD or ST as the register specifier:

    ST #command7,[Rdevice,#registername]

    --------------------------------------------------------------
    I am expecting to use the Chip field to route requests between
    chips. It is plausible that physical device sends an interrupt
    from its PCIe segment across one-or-more chips before arriving
    at the interrupt service port in a particular chips last level
    cache. Other than latency its all part of a large coherent DRAM
    space.

    Is that plausible ? desirable ? or are there reasons to keep
    interrupt processing "more local" to the chip hosting the PCIe
    root complexes ?? {in any event, that is all under SW control.|

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 25 13:40:53 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott,

    Can you foresee any problem is I define the positions in the
    physical address space as: <ECAM: My 66000-style>::

    <1:0> must be 00 for word access
    <2:7> standard MMI/O register
    <12:8> extended MMI/O register
    <25:13> growth space for registers or for functions
    <32:26> PCIe Device, Function
    <40:32> PCIe Bus
    <56:41> PCIe Segment
    <63:57> Chip
    Mitch,

    The BDF will never change; it's been the same since PCI was
    introduced in 1992. It's very unlikely that the size of the
    PCIe configuration space (4096 bytes) will ever change; if
    it does, the current ECAM specification and all Operating
    systems will need to change.

    The PCIe ECAM specfication requires that the bus/dev/function
    fields occupy bits <27:12> and the register address occupy
    bits <11:0>.

    Anything above bit 27 is outside the PCI specification.

    Most systems that support multiple PCIe controllers assign
    each port to an unique segment and thus bits <xx:28> encode
    the PCIe controller number.

    All current operating systems will expect this:

    (From PCI_Express_Base_r3.0_10Nov10.pdf)

    Table 7-1: Enhanced Configuration Address Mapping

    Memory Address PCI Express Configuration Space
    A[(20 + n ­ 1):20] Bus Number 1 <= n <= 8
    A[19:15] Device Number
    A[14:12] Function Number
    A[11:8] Extended Register Number

    A[7:2] Register Number

    A[1:0] Along with size of the access, used to
    generate Byte Enables



    I am expecting the code touching the MMI/O register to have
    a virtual address pointer to B;D,F and use the 16-bit immediate
    field of the LD or ST as the register specifier:

    ST #command7,[Rdevice,#registername]

    Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
    single-copy atomic accesses to all configuration space registers.


    --------------------------------------------------------------
    I am expecting to use the Chip field to route requests between
    chips. It is plausible that physical device sends an interrupt
    from its PCIe segment across one-or-more chips before arriving
    at the interrupt service port in a particular chips last level
    cache. Other than latency its all part of a large coherent DRAM
    space.

    It's fine to use the chip field to route transactions within
    the chip - if you use a 1:1 mapping between the PCI memory
    space and the host memory space, then you can program the
    chip field directly in the MSI-X vector address.

    Or, your host bridge can hold mapping tables that maps the
    downstream PCI memory address space addresses to host
    addresses, inserting the target chip id based on the
    bridge configuration registers (which are defined by
    the host, not PCI).


    Is that plausible ? desirable ? or are there reasons to keep
    interrupt processing "more local" to the chip hosting the PCIe
    root complexes ?? {in any event, that is all under SW control.}

    That depends on how your interrupt controller is designed. If
    you have a multi-socket/multi-chiplet configuration where all
    chiplets are identical, and each has its own interrupt controller
    (allowing single chiplet implementations), then you'll probably
    want to use your CHIP bits in the address to route to the interrupt
    controller on the closest chiplet just to reduce interrupt
    latency. The interrupt controllers will likely need to cooperating
    at the hardware level to maintain a single OS-visible "interrupt space" where each controller handles a subset of the interrupt number space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Jun 25 15:58:51 2024
    Yes, I deserve that.
    I figured out haw "bad" an idea it was at the bar last night.
    Sorry to have wasted so much of your time.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 27 01:47:49 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott,

    Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
    single-copy atomic accesses to all configuration space registers.

    MMI/O is sequentially consistent while Config is Strongly ordered.

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??

    Load Multiple (LM) instruction provides ATOMIC access to a series
    of sequentially ordered bytes in MMI/O (or Config, or DRAM). So,
    if desired, one could read 0x0-0x40 configuration header in a
    single LM. This is completely ATOMIC without SW doing anything
    other than supplying an address and a count.

    Memory to Memory Move (MM) is similar but reads from one place and
    writes to another in a single interconnect transaction--that is
    ATOMICally. {Basically as long as no page boundaries are crossed,
    each memory reference instruction is ATOMIC with respect to
    interested 3rd party observations.}

    Likewise, LDB, LDH, LDW, LDD and their ST counterparts are unit
    ATOMIC.

    But I don't know what you mean by "single-copy atomic accesses" ??

    <snip>

    That depends on how your interrupt controller is designed. If
    you have a multi-socket/multi-chiplet configuration where all
    chiplets are identical, and each has its own interrupt controller
    (allowing single chiplet implementations), then you'll probably
    want to use your CHIP bits in the address to route to the interrupt controller on the closest chiplet just to reduce interrupt
    latency.


    Yes, that was the concern. One might expect that the Guest OS
    would send an I/O request to a Guest OS device driver on the
    chip local to the PCIe tree the device is on, to minimize all
    the latency, not just interrupt delivery. Reading and writing
    of MMI/O space is <as they say> slow.

    The interrupt controllers will likely need to cooperate
    at the hardware level to maintain a single OS-visible "interrupt
    space" where each controller handles a subset of the interrupt
    number space.

    My model has multiple interrupt tables from the get go.

    For a start, I assume that each Guest OS has an interrupt table
    shared across however many number of virtual of physical cores
    the system manages. A HyperVisor has its own Interrupt Table,
    and the Secure Monitor has its own table.

    A core control register points at this table, and is used when
    negotiating for an interrupt, and used to detect Interrupt
    table priority escalation (when a priority bit is turned on).
    If SW wants a different Interrupt Table (or none) a simple
    write to the control register switches the table.*

    An interrupt table has interrupts raised when an MSI or MSI-X
    interrupt arrives at the Interrupt service provider port.

    I have configured the I/O MMU to translate DMA acceses through
    one set of MMU tables, and translate Interrupt access through
    <a conceptually> different MMU tables, and have access to the
    priority of the interrupt without taking MSI-X message bits.

    So, DMA can read or write directly through application MMU
    tables while the associated Interrupt goes to Guest OS at
    priority of SW's choice using the Interrupt table in charge
    when the I/O was setup. -----------------------------------------------------------
    (*) So a virtual machine with 17 virtual cores and accepting
    interrupts on only 5 of them will have 5 with IT set and 12
    with IT set invalid.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Jun 27 11:27:20 2024
    On Thu, 27 Jun 2024 01:47:49 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??


    It sounds as a politically correct way of saying "default memory
    ordering of ARMv8.1-A and later".
    I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
    Probably stronger than POWER, but I am not sure if POWER ever had memory ordering model formalized.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 27 13:52:59 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott,

    Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
    single-copy atomic accesses to all configuration space registers.

    MMI/O is sequentially consistent while Config is Strongly ordered.

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??

    It's a term-of-art in the ARM architecture document (DDI0487).

    A memory access instruction that is single-copy atomic has the following properties:

    1. For a pair of overlapping single-copy atomic store instructions, all
    of the overlapping writes generated by one of the stores are
    Coherence-after the corresponding overlapping writes generated
    by the other store.

    2. For a single-copy atomic load instruction L1 that overlaps a single-copy
    atomic store instruction S2, if one of the overlapping reads generated
    by L1 Reads-from one of the overlapping writes generated by S2, then none
    of the overlapping writes generated by S2 are Coherence-after the
    corresponding overlapping reads generated
    by L1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Thu Jun 27 09:25:34 2024
    Michael S wrote:
    On Thu, 27 Jun 2024 01:47:49 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??


    It sounds as a politically correct way of saying "default memory
    ordering of ARMv8.1-A and later".
    I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
    Probably stronger than POWER, but I am not sure if POWER ever had memory ordering model formalized.


    Multi-copy atomic is ARM's name for a write-update coherence protocol
    as it allows each cache to have its own copy of a single memory location. Single-copy atomic is their name for a write-invalidate protocol
    as it ensures that there is one value for each memory location.

    Originally ARM's weak cache coherence protocol spec, like Alpha,
    did not explicitly exclude multi-copy atomic so software designers had
    to consider all the extra race conditions a write-update implementation
    might allow. But this was wasted extra effort because no one implements
    a write-update protocol, just write-invalidate.
    Eventually ARM specified that it was single-copy atomic (write-invalidate).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 27 17:33:16 2024
    EricP wrote:

    Michael S wrote:
    On Thu, 27 Jun 2024 01:47:49 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??


    It sounds as a politically correct way of saying "default memory
    ordering of ARMv8.1-A and later".
    I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
    Probably stronger than POWER, but I am not sure if POWER ever had memory
    ordering model formalized.


    Multi-copy atomic is ARM's name for a write-update coherence protocol
    as it allows each cache to have its own copy of a single memory
    location.

    Sounds like SNARFing

    Single-copy atomic is their name for a write-invalidate protocol
    as it ensures that there is one value for each memory location.

    Originally ARM's weak cache coherence protocol spec, like Alpha,
    did not explicitly exclude multi-copy atomic so software designers had
    to consider all the extra race conditions a write-update implementation
    might allow. But this was wasted extra effort because no one implements
    a write-update protocol, just write-invalidate.
    Eventually ARM specified that it was single-copy atomic
    (write-invalidate).

    Seems to me that if one is sequentially consistent, then one is also
    multi-copy ATOMIC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 27 17:37:12 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott,

    Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
    single-copy atomic accesses to all configuration space registers.

    MMI/O is sequentially consistent while Config is Strongly ordered.

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??

    It's a term-of-art in the ARM architecture document (DDI0487).

    A memory access instruction that is single-copy atomic has the
    following properties:

    1. For a pair of overlapping single-copy atomic store instructions,
    all
    of the overlapping writes generated by one of the stores are
    Coherence-after the corresponding overlapping writes generated
    by the other store.

    Writes to a small local do not pass each other in the interconnect.

    2. For a single-copy atomic load instruction L1 that overlaps a single-copy
    atomic store instruction S2, if one of the overlapping reads
    generated
    by L1 Reads-from one of the overlapping writes generated by S2,
    then none
    of the overlapping writes generated by S2 are Coherence-after the
    corresponding overlapping reads generated
    by L1.

    Because the LD saw the intermediate data state where some of the STs
    were
    complete while others pend.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jun 28 12:24:50 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Michael S wrote:
    On Thu, 27 Jun 2024 01:47:49 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??


    It sounds as a politically correct way of saying "default memory
    ordering of ARMv8.1-A and later".
    I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
    Probably stronger than POWER, but I am not sure if POWER ever had memory >>> ordering model formalized.


    Multi-copy atomic is ARM's name for a write-update coherence protocol
    as it allows each cache to have its own copy of a single memory
    location.

    Sorry, I had this backwards.
    Multi-copy atomic was ARM's name for what others call store atomicity,
    which requires that a core's stores appear to be seen by all other cores
    at once. This is the effect write-invalidate protocols produce.

    Non-MCA was what weak ordered write-update protocols can cause where
    different nodes can see the same location as having different values.

    Sounds like SNARFing

    Write-update depends on broadcasting all writes if that's what snarf means. Write-update requires write-through caches.
    It requires that the coherence network and controllers all apply
    all writes from all sources in the same order across all caches.

    With a shared snoopy bus and a one level cache all updated synchronously
    this can be efficient as the bus itself acts as an ordering mutex.
    Outside of that it was not considered to scale well as otherwise it
    needs to send a message to each peer node for each write.

    Also it must ensure that writes to the same location are applied in the
    same order across all nodes. When one introduces multiple layers of caches connected with multiple comms queues that synchronization complicates.
    If one considers mesh networks, messages from even the same source
    may arrive at different nodes in different order.

    I didn't know that any production systems used write-update.
    The book "A Primer on Memory Consistency and Cache Coherence" 2nd Ed 2020
    has a chapter on write-update protocols and according to it examples of
    systems that used write-update are Sun Starfire E10000 and IBM Power5.
    Starfire used point-to-point messaging to create a "logical" shared bus.
    Power5 used a unidirectional ring network.

    Originally ARM's weak cache coherence protocol spec, like Alpha,
    did not explicitly exclude multi-copy atomic so software designers had
    to consider all the extra race conditions a write-update implementation
    might allow. But this was wasted extra effort because no one implements
    a write-update protocol, just write-invalidate.
    Eventually ARM specified that it was single-copy atomic
    (write-invalidate).

    Correction: They originally did not explicitly require store atomicity (MCA), implying that a weak ordered write-update protocol might allow a single location to be seen as having different values on different nodes.

    Seems to me that if one is sequentially consistent, then one is also multi-copy ATOMIC.

    Yes, store atomicity to each locations would be implied by SC
    otherwise how could all nodes agree on the order of all updates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jun 28 20:30:56 2024
    EricP wrote:

    MitchAlsup1 wrote:

    Seems to me that if one is sequentially consistent, then one is also
    multi-copy ATOMIC.

    Yes, store atomicity to each locations would be implied by SC
    otherwise how could all nodes agree on the order of all updates.

    Most of the time cores only need to agree about cache consistency
    and this can be satisfied by causal consistency.

    ATOMIC stuff is where cores starts to require SC,
    and all MMI/O should be SC or SC per virtual channel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jun 28 20:26:42 2024
    EricP wrote:

    MitchAlsup1 wrote:

    Sounds like SNARFing

    Write-update depends on broadcasting all writes if that's what snarf
    means.

    General cache coherency policies broadcast a cores address to all
    caches in the system, and if that cache contains that same cache
    line, it responds with a SHARED back to requestor, or it invalidates
    the line. We call this SNOOPing. It works well.

    SNARF is a term whereby the owner of data broadcasts the data and
    its address, and any cache containing that line will write the
    data payload into its cache 9rather than invalidating and then
    going back and fetching it anew. For certain kinds of data struct
    SNARF is significantly more efficient than Invalidate-Refetch.
    A single message around the system performs all the needed updates,
    instead of 1 invalidate and K fetches.

    SNARF is almost exclusively used as side-band signals hiding under
    the cache coherent Interconnect command set.

    SNARF is almost never available to software. It is more like micro- Architecture talking to other microArchitecture.

    Also note: µA-to-µA is rarely of line size and often uses physical
    address bits not available through MMU tables.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Sun Jun 30 00:41:05 2024
    On Fri, 28 Jun 2024 20:26:42 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    EricP wrote:

    MitchAlsup1 wrote:

    Sounds like SNARFing

    Write-update depends on broadcasting all writes if that's what snarf
    means.

    General cache coherency policies broadcast a cores address to all
    caches in the system, and if that cache contains that same cache
    line, it responds with a SHARED back to requestor, or it invalidates
    the line. We call this SNOOPing. It works well.

    SNARF is a term whereby the owner of data broadcasts the data and
    its address, and any cache containing that line will write the
    data payload into its cache 9rather than invalidating and then
    going back and fetching it anew. For certain kinds of data struct
    SNARF is significantly more efficient than Invalidate-Refetch.
    A single message around the system performs all the needed updates,
    instead of 1 invalidate and K fetches.

    SNARF is almost exclusively used as side-band signals hiding under
    the cache coherent Interconnect command set.

    SNARF is almost never available to software. It is more like micro- >Architecture talking to other microArchitecture.

    Also note: µA-to-µA is rarely of line size and often uses physical
    address bits not available through MMU tables.


    Stupid question: why is it called "snarf"?

    IIRC, Snoopy (Peanuts) "scarfed" his food. I don't recall ever seeing
    Snarf (Thundercats) actually eat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to George Neuner on Sun Jun 30 16:16:01 2024
    George Neuner wrote:

    On Fri, 28 Jun 2024 20:26:42 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    EricP wrote:

    MitchAlsup1 wrote:

    Sounds like SNARFing

    Write-update depends on broadcasting all writes if that's what snarf
    means.

    General cache coherency policies broadcast a cores address to all
    caches in the system, and if that cache contains that same cache
    line, it responds with a SHARED back to requestor, or it invalidates
    the line. We call this SNOOPing. It works well.

    SNARF is a term whereby the owner of data broadcasts the data and
    its address, and any cache containing that line will write the
    data payload into its cache 9rather than invalidating and then
    going back and fetching it anew. For certain kinds of data struct
    SNARF is significantly more efficient than Invalidate-Refetch.
    A single message around the system performs all the needed updates,
    instead of 1 invalidate and K fetches.

    SNARF is almost exclusively used as side-band signals hiding under
    the cache coherent Interconnect command set.

    SNARF is almost never available to software. It is more like micro- >>Architecture talking to other microArchitecture.

    Also note: µA-to-µA is rarely of line size and often uses physical >>address bits not available through MMU tables.


    Stupid question: why is it called "snarf"?

    I don't really know--first heard the term in 1982 as a SNOOP but in
    the other direction--instead of taking data away, it put data back.

    IIRC, Snoopy (Peanuts) "scarfed" his food. I don't recall ever seeing
    Snarf (Thundercats) actually eat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to EricP on Mon Jul 1 09:33:37 2024
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Michael S wrote:
    On Thu, 27 Jun 2024 01:47:49 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Exactly what are you intending to mean from "single-copy atomic
    accesses" ??

    It sounds as a politically correct way of saying "default memory
    ordering of ARMv8.1-A and later".
    I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
    Probably stronger than POWER, but I am not sure if POWER ever had memory
    ordering model formalized.


    Multi-copy atomic is ARM's name for a write-update coherence protocol
    as it allows each cache to have its own copy of a single memory location.

    The terminology is not Arm's, it comes from

    William W. Collier. 1992. Reasoning about parallel architectures.
    Prentice Hall, Englewood Cliffs.

    Single-copy atomic is their name for a write-invalidate protocol
    as it ensures that there is one value for each memory location.

    Originally ARM's weak cache coherence protocol spec, like Alpha,
    did not explicitly exclude multi-copy atomic so software designers had
    to consider all the extra race conditions a write-update implementation
    might allow. But this was wasted extra effort because no one implements
    a write-update protocol, just write-invalidate.
    Eventually ARM specified that it was single-copy atomic (write-invalidate).

    And it's now multi-copy atomic, thank goodness.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Thu Jul 4 17:34:33 2024
    On Page Request Services (PRS)

    Device performs an Address Translation Request (ATS) to a page
    which is not currently present in memory. So the I/O MMU sends
    it a PTE which contains the not-present information.

    But the system operates with nested paging. one manipulated
    by Guest OS and the other manipulated by HyperVisor.

    Yet device merely got "not-present"

    So when device requests the page be brought in, how does the
    I/O MMU know whether to interrupt Guest OS or to interrupt
    Hypervisor to bring in the page and restart the command ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jul 5 18:57:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Page Request Services (PRS)

    Device performs an Address Translation Request (ATS) to a page
    which is not currently present in memory. So the I/O MMU sends
    it a PTE which contains the not-present information.

    But the system operates with nested paging. one manipulated
    by Guest OS and the other manipulated by HyperVisor.

    Yet device merely got "not-present"

    So when device requests the page be brought in, how does the
    I/O MMU know whether to interrupt Guest OS or to interrupt
    Hypervisor to bring in the page and restart the command ??


    The stream id which identifies the DMA stream from the device
    also identifies the page requests, so they're queued by the
    IOMMU based on IOMMU configuration tables (i.e. an inbound
    translation or pri request will first lookup the stream ID
    to determine the translation table base register for that
    stream). If PRI is supported, it will also queue the page
    request to a queue corresponding to the hypervisor or
    guest that is configured as the owner of that stream
    and generate an interrupt to the hv/kernel/guest. The
    interrupt can be defered for a page request group and
    will be delivered only when the 'last' bit is set in the
    request.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Wed Jul 10 17:59:46 2024
    On page 43 of:: https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    it states: "Must not indicate an invalidation has completed
    until all outstanding Read Requests that reference the
    associated translation have retired"

    "Must insure that the invalidation completion indication to RC
    will arrive at the RC after previously posted writes that use
    the stale address."

    and

    "...If transactions are in a queue waiting to be sent, It is
    not necessary for the device to expunge requests from the
    queue even if those transaction[s] use an address that is
    being invalidated."

    The first 2 seem to be PCIe ordering requirements between
    EP and RC.

    The 3rd seems to say if EP used a translation while it was
    valid, then its invalidation does not prevent requests
    using the now stale translation.

    So, a SATA device could receive a command to read a page
    into memory. SATA EP requests ATS for the translation of
    the given virtual address to the physical page. Then the
    EP creates a queue of write requests filling in the addr
    while waiting on data. Once said queue has been filled,
    and before the data comes off the disk, an invalidation
    arrives and is ACKed. The data is still allowed to write
    into memory.

    {{But any new command to the SATA device would not be
    allowed to use the translation.}}

    Is this a reasonable interpretation of that page?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to mitchalsup@aol.com on Wed Jul 10 18:21:14 2024
    In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On page 43 of:: >https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    it states: "Must not indicate an invalidation has completed
    until all outstanding Read Requests that reference the
    associated translation have retired"

    "Must insure that the invalidation completion indication to RC
    will arrive at the RC after previously posted writes that use
    the stale address."

    and

    "...If transactions are in a queue waiting to be sent, It is
    not necessary for the device to expunge requests from the
    queue even if those transaction[s] use an address that is
    being invalidated."

    The first 2 seem to be PCIe ordering requirements between
    EP and RC.

    The 3rd seems to say if EP used a translation while it was
    valid, then its invalidation does not prevent requests
    using the now stale translation.

    So, a SATA device could receive a command to read a page
    into memory. SATA EP requests ATS for the translation of
    the given virtual address to the physical page. Then the
    EP creates a queue of write requests filling in the addr
    while waiting on data. Once said queue has been filled,
    and before the data comes off the disk, an invalidation
    arrives and is ACKed. The data is still allowed to write
    into memory.

    {{But any new command to the SATA device would not be
    allowed to use the translation.}}

    Is this a reasonable interpretation of that page?

    No, it's saying that the EP can keep using a stale translation UNTIL it
    returns the ACK for an invalidation. It does not need to toss those requests--it just needs to delay the ACK. Or it could toss the requests,
    and then send the ACK faster, but it's optional if it wants to toss requests.

    Once the EP sends the ACK, it can no longer send any transactions
    using the old translation.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Kent Dickey on Wed Jul 10 19:02:12 2024
    kegs@provalid.com (Kent Dickey) writes:
    In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On page 43 of:: >>https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    it states: "Must not indicate an invalidation has completed
    until all outstanding Read Requests that reference the
    associated translation have retired"

    "Must insure that the invalidation completion indication to RC
    will arrive at the RC after previously posted writes that use
    the stale address."

    and

    "...If transactions are in a queue waiting to be sent, It is
    not necessary for the device to expunge requests from the
    queue even if those transaction[s] use an address that is
    being invalidated."

    The first 2 seem to be PCIe ordering requirements between
    EP and RC.

    The 3rd seems to say if EP used a translation while it was
    valid, then its invalidation does not prevent requests
    using the now stale translation.

    So, a SATA device could receive a command to read a page
    into memory. SATA EP requests ATS for the translation of
    the given virtual address to the physical page. Then the
    EP creates a queue of write requests filling in the addr
    while waiting on data. Once said queue has been filled,
    and before the data comes off the disk, an invalidation
    arrives and is ACKed. The data is still allowed to write
    into memory.

    {{But any new command to the SATA device would not be
    allowed to use the translation.}}

    Is this a reasonable interpretation of that page?

    No, it's saying that the EP can keep using a stale translation UNTIL it >returns the ACK for an invalidation. It does not need to toss those >requests--it just needs to delay the ACK. Or it could toss the requests,
    and then send the ACK faster, but it's optional if it wants to toss requests.


    Indeed. And I'd suggest that the official PCI Express
    specification is a better source than a set of slides.

    From the spec:

    a. A Function is required not to indicate the invalidation has completed until
    all outstanding Read Requests or Translation Requests that reference the
    associated translated address have been retired or nullified.
    b. A Function is required to ensure that the Invalidate Completion indication
    to the RC will arrive at the RC after any previously posted writes that use
    the "stale" address.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 10 22:01:27 2024
    On Wed, 10 Jul 2024 19:02:12 +0000, Scott Lurndal wrote:

    kegs@provalid.com (Kent Dickey) writes:
    In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On page 43 of:: >>>https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    it states: "Must not indicate an invalidation has completed
    until all outstanding Read Requests that reference the
    associated translation have retired"

    "Must insure that the invalidation completion indication to RC
    will arrive at the RC after previously posted writes that use
    the stale address."

    and

    "...If transactions are in a queue waiting to be sent, It is
    not necessary for the device to expunge requests from the
    queue even if those transaction[s] use an address that is
    being invalidated."

    The first 2 seem to be PCIe ordering requirements between
    EP and RC.

    The 3rd seems to say if EP used a translation while it was
    valid, then its invalidation does not prevent requests
    using the now stale translation.

    So, a SATA device could receive a command to read a page
    into memory. SATA EP requests ATS for the translation of
    the given virtual address to the physical page. Then the
    EP creates a queue of write requests filling in the addr
    while waiting on data. Once said queue has been filled,
    and before the data comes off the disk, an invalidation
    arrives and is ACKed. The data is still allowed to write
    into memory.

    {{But any new command to the SATA device would not be
    allowed to use the translation.}}

    Is this a reasonable interpretation of that page?

    No, it's saying that the EP can keep using a stale translation UNTIL it >>returns the ACK for an invalidation. It does not need to toss those >>requests--it just needs to delay the ACK. Or it could toss the requests, >>and then send the ACK faster, but it's optional if it wants to toss
    requests.


    Indeed. And I'd suggest that the official PCI Express
    specification is a better source than a set of slides.

    From the spec:

    I do not have access through the PCIe paywall.

    a. A Function is required not to indicate the invalidation has completed until
    all outstanding Read Requests or Translation Requests that reference
    the
    associated translated address have been retired or nullified.
    b. A Function is required to ensure that the Invalidate Completion
    indication
    to the RC will arrive at the RC after any previously posted writes
    that use
    the "stale" address.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 10 22:51:19 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 10 Jul 2024 19:02:12 +0000, Scott Lurndal wrote:



    Indeed. And I'd suggest that the official PCI Express
    specification is a better source than a set of slides.

    From the spec:

    I do not have access through the PCIe paywall.

    A google search turned up a couple older ones. Version 4
    and up describe ATS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sun Jul 28 22:19:44 2024
    On Wed, 10 Jul 2024 17:59:46 +0000, MitchAlsup1 wrote:

    On page 34 of:: https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    They/He uses the notation VP# (virtual Plane number)

    Is that what we have been calling the PCIe "Segment" ?? from ECAM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jul 29 03:10:58 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 10 Jul 2024 17:59:46 +0000, MitchAlsup1 wrote:

    On page 34 of:: >https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

    They/He uses the notation VP# (virtual Plane number)

    Is that what we have been calling the PCIe "Segment" ?? from ECAM

    No, 'VP#' is a concept related to multi-root MRIOV (i.e. where
    multiple host root complexes are sharing a single SR-IOV capable
    endpoint via one or more multi-root capable PCI Express switches.

    Each root complex which shares resources of an SR-IOV endpoint
    physical function will operate in that virtual plane (as defined
    by the VP field in the root complex MRIOV capability).

    This is independent of the 'segment' or 'domain' notation to
    concatenate ECAM regions for multiple RCs on a single host.

    Note that MR IOV is very rare at this point (two decades
    after the above PCI-SIG presentation - I think I was at
    that meeting, actually).

    VP is to qualify BDF on the PCI fabric in config TLPs. Segment/domain are
    the mechanisms used to qualify access to the configuration space
    by the host via the host ECAM region(s) - they're not really a
    PCI express concept; PCIe simpley defines one ECAM per root complex
    (intel calls them segments, arm calls them domains).

    Concatenating the RC ECAM regions leads to using bits <20+n:20> as
    the 'segment' number for host accesess to the concatenated
    ECAM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)