• Re: Hints in the instruction set (was: Redundant prefixes break fsrm ..

    From Thomas Koenig@21:1/5 to Anton Ertl on Sun Nov 19 15:02:05 2023
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make
    performance for the unusual, but relevant, case worse.

    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sun Nov 19 16:09:00 2023
    In article <4aq6N.1807$Jbsd.1159@fx03.iad>, scott@slp53.sl.home (Scott
    Lurndal) wrote:

    ARMv8 has a large space reserved for hint instructions, which
    include a wide-ranging set of functionality, such as:

    Some pointer authentication instructions.

    Another is "Branch Target Indicator" (BTI) which has no-op semantics in
    itself. But if BTI traps are enabled, any branch that doesn't arrive at a
    BTI traps, as an illegal instruction in the Android implementation I've
    been tinkering with.

    BTI traps on Android are enabled by the OS, using the page table AFAICS,
    for the text segments of executables and shared libraries which are
    marked as compatible. They're marked that way by the LLVM linker if all
    the object files that went into them were marked as compatible; the
    compiler sets that option if it's given the correct option.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Sun Nov 19 15:51:28 2023
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    ARMv8 has a large space reserved for hint instructions, which
    include a wide-ranging set of functionality, such as:

    WFI, WFE (wait for interrupt or event). These are specified
    such that they don't need to actually do anything, but may
    if an implemetation e.g. supports entering low power states
    while waiting.

    Barrier instructions, which may be no-ops on some implementations.
    (e.g. trace buffer barrier, exception synchronization barrier, etc).

    YIELD instruction, which may be a no-op on some implementations.

    Some pointer authentication instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Nov 19 15:43:17 2023
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case. And I have not heard
    about such proofs for dynamic branch predictors, so static branch
    prediction (hints) may indeed be the way to go.

    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make
    performance for the unusual, but relevant, case worse.

    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it. One
    problem that they had was that the usual n-way set-associative caches
    with n>2 don't have proper LRU replacement, but pseudo-LRU; with these
    caches the guarantees degrade to those of a 2-way cache (with the same
    way sizes). I don't remember if they used that for data or for
    instructions.

    I have not heard any advances in WCET since that time, but maybe I
    just went to the wrong meetings.

    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    When I heard about the cache work, they also talked about the
    processors and what you know about their execution time. IIRC they
    found that most high-performance architectures of the day were
    problematic.

    One other thing I remember is that on some PowerPC CPU one could lock
    certain cache lines in the cache, so they will not be replaced. So if
    you use 6 of your 8 ways (of a 32KB cache with 4KB ways) for locking
    stuff into the cache, and the other two ways for performing analysable accesses, it's a lot better than a CPU without a cache.

    Now ARM offers cores with the R profile (e.g., ARMv8-R), where R
    stands for real-time. I have not looked what the properties of these
    cores are. I found it surprising that the big market for them seems
    to be disk drives and such things.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Nov 19 16:36:10 2023
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases.

    There is one exception, whose value to society is debatable, but it
    exists nonetheless: High-speed financial trading. These guys (or
    rather, their computers) spend a lot time analyzing. They make very
    few trades per CPU cycle, but if they do, they want to be fast. So,
    latency for the less-commonly travelled path becomes the main objective.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Nov 19 17:21:26 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    One other thing I remember is that on some PowerPC CPU one could lock
    certain cache lines in the cache, so they will not be replaced.

    The Cavium MIPS cores had similar capabilities. They also had
    a mechanism that allowed software to push a complete reserved
    cache line (128 bytes) to an on-chip coprocessor atomically.

    ARMv8 has an optional architectural feature called memory
    partitioning and management (MPAM) that supports cache
    allocation and memory controller bandwidth allocation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Sun Nov 19 17:22:46 2023
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so >>>> they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases.

    There is one exception, whose value to society is debatable, but it
    exists nonetheless: High-speed financial trading. These guys (or
    rather, their computers) spend a lot time analyzing. They make very
    few trades per CPU cycle, but if they do, they want to be fast. So,
    latency for the less-commonly travelled path becomes the main objective.

    Although they're more generally interested in network latency than
    cache latency, to the extent that they colocate their trading systems
    at the exchange data centers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Nov 19 18:00:15 2023
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases.

    There is one exception, whose value to society is debatable, but it
    exists nonetheless: High-speed financial trading. These guys (or
    rather, their computers) spend a lot time analyzing. They make very
    few trades per CPU cycle, but if they do, they want to be fast. So,
    latency for the less-commonly travelled path becomes the main objective.

    I don't think that this use fits in the hard RTS frame at all. They
    have no deadline, but want to be as fast as possible, in as many cases
    as possible, i.e., the typical setup for mainstream processors. They
    don't fail if they miss a deadline, they fail if the competitors make
    the trade faster than they do. But failures are not catastrophic.
    It's good enough if they are faster than the majority of the
    competitors the majority of time.

    Concerning the uncommonness of actually making a trade, one way of
    addressing this that comes to my mind is to have a core that just
    performs trades, and waits for it with a spinlock (so this core does
    not have a low clockspeed when the trade comes along). Because it
    only does trades and the spinlock, everything is warmed up for
    trading, there is only one branch miss for coming out of the spinlock.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Sun Nov 19 18:09:31 2023
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For >>>>> the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors. >>>>>
    * Prefetch instructions? Hardware prefetchers tend to work better, so >>>>> they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases.

    There is one exception, whose value to society is debatable, but it
    exists nonetheless: High-speed financial trading. These guys (or
    rather, their computers) spend a lot time analyzing. They make very
    few trades per CPU cycle, but if they do, they want to be fast. So, >>latency for the less-commonly travelled path becomes the main objective.

    Although they're more generally interested in network latency than
    cache latency, to the extent that they colocate their trading systems
    at the exchange data centers.

    True.

    However, if competitor's machines are in the same building, execution
    speed can still play a decisive role...

    (If it was up to me to regulate, I would probably add a mandated
    random delay to each transaction, with audits to prove later that
    this has been applied fairly).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sun Nov 19 19:25:50 2023
    Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case. And I have not heard
    about such proofs for dynamic branch predictors, so static branch
    prediction (hints) may indeed be the way to go.

    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make
    performance for the unusual, but relevant, case worse.

    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it. One
    problem that they had was that the usual n-way set-associative caches
    with n>2 don't have proper LRU replacement, but pseudo-LRU; with these
    caches the guarantees degrade to those of a 2-way cache (with the same
    way sizes). I don't remember if they used that for data or for
    instructions.

    I have not seen real LRU for 2 decades. We mostly use what is called::
    "not recently used" which is a set of n-bits (n==sets):: When then n-th
    bit gets set, all n-bits are cleared.

    I have not heard any advances in WCET since that time, but maybe I
    just went to the wrong meetings.

    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    When I heard about the cache work, they also talked about the
    processors and what you know about their execution time. IIRC they
    found that most high-performance architectures of the day were
    problematic.

    Hard Real Time does not like caches if you are within 50% of consuming
    all CPU cycles; and does not like branch predictors, or prefetchers.

    One other thing I remember is that on some PowerPC CPU one could lock
    certain cache lines in the cache, so they will not be replaced. So if
    you use 6 of your 8 ways (of a 32KB cache with 4KB ways) for locking
    stuff into the cache, and the other two ways for performing analysable accesses, it's a lot better than a CPU without a cache.

    We used to use Cache line locking to take a set of cache lines (say 4)
    and lock one whose data or tag store was "bad" and that line would go
    from n-way set associative to (n-1)-way set associative.

    Now ARM offers cores with the R profile (e.g., ARMv8-R), where R
    stands for real-time. I have not looked what the properties of these
    cores are. I found it surprising that the big market for them seems
    to be disk drives and such things.

    Why would disk drives NEED Real Time controller ??

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Sun Nov 19 12:46:10 2023
    On 11/19/2023 11:25 AM, MitchAlsup wrote:
    Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant.  For >>>> the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints?  We have branch predictors.

    * Branch target hints?  We have BTBs and indirect branch predictors.

    * Prefetch instructions?  Hardware prefetchers tend to work better, so >>>>   they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases.  So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case.  And I have not heard
    about such proofs for dynamic branch predictors, so static branch
    prediction (hints) may indeed be the way to go.

    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make
    performance for the unusual, but relevant, case worse.

    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it.  One
    problem that they had was that the usual n-way set-associative caches
    with n>2 don't have proper LRU replacement, but pseudo-LRU; with these
    caches the guarantees degrade to those of a 2-way cache (with the same
    way sizes).  I don't remember if they used that for data or for
    instructions.

    I have not seen real LRU for 2 decades. We mostly use what is called::
    "not recently used" which is a set of n-bits (n==sets):: When then n-th
    bit gets set, all n-bits are cleared.

    I have not heard any advances in WCET since that time, but maybe I
    just went to the wrong meetings.

    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    When I heard about the cache work, they also talked about the
    processors and what you know about their execution time.  IIRC they
    found that most high-performance architectures of the day were
    problematic.

    Hard Real Time does not like caches if you are within 50% of consuming
    all CPU cycles; and does not like branch predictors, or prefetchers.

    One other thing I remember is that on some PowerPC CPU one could lock
    certain cache lines in the cache, so they will not be replaced.  So if
    you use 6 of your 8 ways (of a 32KB cache with 4KB ways) for locking
    stuff into the cache, and the other two ways for performing analysable
    accesses, it's a lot better than a CPU without a cache.

    We used to use Cache line locking to take a set of cache lines (say 4)
    and lock one whose data or tag store was "bad" and that line would go
    from n-way set associative to (n-1)-way set associative.

    Now ARM offers cores with the R profile (e.g., ARMv8-R), where R
    stands for real-time.  I have not looked what the properties of these
    cores are.  I found it surprising that the big market for them seems
    to be disk drives and such things.

    Why would disk drives NEED Real Time controller ??

    Caveat. This was all true about 25 years ago when I retired, but may
    have changed.

    Several areas. As the disk spins, you have a limited amount of time
    from when the header comes under the heads to read the header, verify
    the ECC, check if the record number in the header matches the desired
    one and does not represent a defect area to be skipped, and if
    everything is correct, start the transfer into the buffer, or from the
    buffer to the write circuitry. Of course, in this case, time is space,
    so you want to minimize this time so you can minimize the gap space to
    allow maximum use of the track for data.

    Another area is is disk arm tracking. Due to run out, the tracks are
    not perfectly circular, and so the head position must be adjusted in
    real time. There are servo bursts spaced periodically around the track
    and the hardware decodes these to provide information to the head
    positioning algorithm to slightly move the head boom in or out a little
    to place it optimally above the data.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Mon Nov 20 01:42:10 2023
    On 2023-11-19 17:43, Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case.


    Yes, for the classical "static" WCET analysis approach. There are other "hybrid" or "probabilistic" methods to compute "WCET estimates" that
    should have a very small probability of being exceeded. The idea is that
    once that probability is smaller than the probability of system failure
    from other causes (say, uncorrectable HW failure) the "WCET" estimate is
    good enough.


    And I have not heard
    about such proofs for dynamic branch predictors, so static branch
    prediction (hints) may indeed be the way to go.


    There are published "static" WCET analyses of various kinds of dynamic
    branch predictors, in general analogous to static WCET analyses of
    caches. But I don't know how well they perform.


    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make
    performance for the unusual, but relevant, case worse.


    Indeed, and often they also create side channels for security breaches.


    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it.


    Yes, AbsInt GmbH. See www.absint.com for their commercial WCET analysis
    tools. There are also several open-source, non-commercial tools.


    One problem that they had was that the usual n-way set-associative
    caches with n>2 don't have proper LRU replacement, but pseudo-LRU;
    with these caches the guarantees degrade to those of a 2-way cache
    (with the same way sizes). I don't remember if they used that for
    data or for instructions.

    Yep, and some caches even have randomized replacement (which can in fact
    be good for the probabilistic WCET-analysis methods). Another problem is
    the use of united I+D caches, where the difficulty of statically
    predicting D addresses harms the analysis of I accesses with statically
    known addresses.


    I have not heard any advances in WCET since that time, but maybe I
    just went to the wrong meetings.


    For many years there has been an annual WCET Analysis Workshop in
    connection with the ECRTS conferences (Euromicro Conference on Real-Time Systems. https://www.ecrts.org/about-ecrts/).


    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    When I heard about the cache work, they also talked about the
    processors and what you know about their execution time. IIRC they
    found that most high-performance architectures of the day were
    problematic.


    Indeed. The problems are partly due to increasing processor complexity,
    partly to the poor or lacking (unavailable) documentation of processor microarchitectures, making their cycle-accurate modelling/analysis
    impossible.

    In response, AbsInt have broadened their tool-set to include hybrid measurement-and-analysis WCET tools and approximate static analysers for "exploring" the likely execution times of an application, without
    producing a guaranteed WCET bound.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Sun Nov 19 23:28:19 2023
    On Sun, 19 Nov 2023 19:25:50 +0000, mitchalsup@aol.com (MitchAlsup)
    wrote:

    Hard Real Time does not like caches if you are within 50% of consuming
    all CPU cycles; and does not like branch predictors, or prefetchers.

    Please forgive me if I am badly mistaken, but it sounds to me like you
    may be conflating "real time" with "real fast".

    Although they often do go hand in hand, "real time" is only about
    meeting deadlines: speed, load, cache behavior, etc., are relevant
    only to the extent that they cause you to miss deadlines.


    I used to do HRT machine vision industrial QA/QC systems. Unless the
    machinery is on fire[*], the conveyor keeps going - so these systems
    were "hard" in the sense that there were absolute deadlines to provide
    results.

    [*] sometimes the conveyor keeps going even if the machinery is on
    fire. 8-)

    Some systems simultaneously checked multiple parts at different stages
    of production and with differing deadlines. Sometimes the objective
    was to waylay a bad part at an upcoming sort station, other times a
    bad part would just continue on down the line and the objective was to
    notify the machine(s) to avoid doing any more work on it.

    At the same time, the systems had to drive graphic displays showing
    the operator what was going on in near real time. This usually took
    the form of one or more (reduced in size) images of actual inspected
    parts overlaid with identified "defects" [colored to distinguish
    warnings from failures], along with runtime counts of passed, failed,
    warned, and total parts so the operator could judge progress of the
    job and performance of their machinery.

    Most systems had to be made to work with already existing machinery,
    so I usually had no control over deadlines and compute intervals - I
    simply had to deal with them. Deadlines ranged from ~20ms at the very
    low end to ~900ms at the very high end. Depending on cameras just
    capturing images could take 16..66ms before computation could even
    start. Often there were multiple threads[*] simultaneously performing different inspections on different cameras with different deadlines.

    [*] using "thread" loosely here: some systems really were co-routines
    or multiple processes due to OS/RTS not supporting real threads.


    Naturally, the idea was to do the job using the lowest cost hardware
    possible, and often that meant a SIMD capable Pentium SBC. I had
    systems running on everything from P5-MMX to Pentium 4. No reason to
    serve up a 1GHz Pentium 4 [with SSE(2,3,4.x)] if a 233MHz Pentium II
    [with MMX] would do the job. Often that meant coding multiple
    versions for MMX and SSE(2,3,4.x), or for MMX+FPU and MMX+SSE so
    [where possible] we had a choice to run on different CPUs at different
    price points.

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Mon Nov 20 06:18:11 2023
    George Neuner <gneuner2@comcast.net> schrieb:

    Although they often do go hand in hand, "real time" is only about
    meeting deadlines: speed, load, cache behavior, etc., are relevant
    only to the extent that they cause you to miss deadlines.

    The most succinct definition I have heard of a real-time system
    is that "a late answer is a wrong answer".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Niklas Holsti on Mon Nov 20 08:18:14 2023
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2023-11-19 17:43, Anton Ertl wrote:
    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case.


    Yes, for the classical "static" WCET analysis approach. There are other >"hybrid" or "probabilistic" methods to compute "WCET estimates" that
    should have a very small probability of being exceeded. The idea is that
    once that probability is smaller than the probability of system failure
    from other causes (say, uncorrectable HW failure) the "WCET" estimate is
    good enough.

    The question is how much we can trust such estimates. If you asked
    experts in nuclear security in 2010 to estimate the probablility of
    having n meltdowns in light-water reactors in one year, they would
    have provided a vanishingly small number for n=3, probably lower than
    the "probability of system failure" you mention. Yet n=3 happened in
    2011, and, I think, if you asked such experts after 2011, they would
    not give such low estimates.

    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it.


    Yes, AbsInt GmbH. See www.absint.com for their commercial WCET analysis >tools. There are also several open-source, non-commercial tools.

    Yes, that's the company, thanks for reminding me.

    For many years there has been an annual WCET Analysis Workshop in
    connection with the ECRTS conferences (Euromicro Conference on Real-Time >Systems. https://www.ecrts.org/about-ecrts/).

    This is not my research area, so I only ever heard about this stuff in
    other compiler researcher meetings. Anyway, if they still have
    meetings, I assume there is still progress in that area, although I
    would have to look at it in more detail to see if there is still work
    on static WCET, or if that work has stopped and they are making do
    with probablistic methods, because the users want more performance
    than can be guaranteed with static WCET methods.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Mon Nov 20 11:51:28 2023
    On 2023-11-20 10:18, Anton Ertl wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2023-11-19 17:43, Anton Ertl wrote:
    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case.


    Yes, for the classical "static" WCET analysis approach. There are other
    "hybrid" or "probabilistic" methods to compute "WCET estimates" that
    should have a very small probability of being exceeded. The idea is that
    once that probability is smaller than the probability of system failure >>from other causes (say, uncorrectable HW failure) the "WCET" estimate is
    good enough.

    The question is how much we can trust such estimates.


    Indeed, and I don't trust them much or at all.


    If you asked experts in nuclear security in 2010 to estimate the
    probablility of having n meltdowns in light-water reactors in one
    year, they would have provided a vanishingly small number for n=3,
    probably lower than the "probability of system failure" you mention.
    Yet n=3 happened in 2011, and, I think, if you asked such experts
    after 2011, they would not give such low estimates.

    I think the probabilistic WCET estimates are, or try to be, on a bit
    more solid ground. They start by measuring the execution times of basic
    code blocks in a suite of tests, followed by constructing the worst-case execution path based on the measured block times, with an estimate of
    the probability of exceeding that worst case based on "extreme value statistics" (https://en.wikipedia.org/wiki/Extreme_value_theory). But
    this depends on assumptions about the statistics and statistical
    independence of the variations of the execution time of different code
    blocks, which IMO is suspect for conventional processors, but is perhaps
    true for randomized HW such as caches with randomized replacement policies.

    There are of course several publications of benchmark and case studies
    showing that such "pWCET" estimates were not exceeded in their tests.
    But this does not convince me that the statistical analysis is correct,
    because for non-trivial programs the construction of the worst-case
    execution path usually introduces a lot of pessimism that increases the
    pWCET estimate and may hide the details of the extreme-value theory.


    For many years there has been an annual WCET Analysis Workshop in
    connection with the ECRTS conferences (Euromicro Conference on Real-Time
    Systems. https://www.ecrts.org/about-ecrts/).

    This is not my research area, so I only ever heard about this stuff in
    other compiler researcher meetings. Anyway, if they still have
    meetings, I assume there is still progress in that area, although I
    would have to look at it in more detail to see if there is still work
    on static WCET, or if that work has stopped and they are making do
    with probablistic methods, because the users want more performance
    than can be guaranteed with static WCET methods.


    One problem is that it is becoming harder to find any processors with
    simple, fixed execution times. Even small microcontrollers often have
    "flash accelerators" that work a bit like instruction caches.

    Current research in static WCET analysis seems focused mostly on
    multi-core processors, with analyses trying to bound inter-core
    interference / blocking caused by shared resources such as buses and higher-level caches. The most common approach is some variation of Time-Division Multiple Access (TDMA) methods, which imply restrictions
    on task scheduling that are not pleasant for SW developers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tkoenig@netcologne.de on Mon Nov 20 10:45:33 2023
    On Mon, 20 Nov 2023 06:18:11 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    George Neuner <gneuner2@comcast.net> schrieb:

    Although they often do go hand in hand, "real time" is only about
    meeting deadlines: speed, load, cache behavior, etc., are relevant
    only to the extent that they cause you to miss deadlines.

    The most succinct definition I have heard of a real-time system
    is that "a late answer is a wrong answer".

    In many systems early answers also are wrong.

    The truth is that a hard real time (HRT) computation has a defined
    time window during which a delivered result is meaningful. Outside of
    that defined window, any result is meaningless.
    [Of course the delivery "window" may start concurrently with beginning
    the computation, but that isn't always the case.]

    Soft real time (SRT) is distinguished from HRT in that a result /may/
    still be meaningful even if delivered late.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Nov 20 11:03:31 2023
    Yes, for the classical "static" WCET analysis approach. There are other "hybrid" or "probabilistic" methods to compute "WCET estimates" that should have a very small probability of being exceeded. The idea is that once that probability is smaller than the probability of system failure from other causes (say, uncorrectable HW failure) the "WCET" estimate is good enough.

    I expect there are often other factors in determining the
    desired probability. IOW it's probably(!) more like "the probability is
    low enough that we can tolerate this rate of failure".

    IOW this turns the distinction between "soft" and "hard" real time
    from a yes/no question to a continuum.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Mon Nov 20 17:18:14 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    mitchalsup@aol.com (MitchAlsup) writes:

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so >>>> they fell into disuse.

    All can be interesting for real-time systems, which react to
    rare occurrences, and where performance for these matters (and
    does not for the normal case).

    The things I have heard about hard real-time systems (RTS) and
    worst-case execution time (WCET) analysis for hard RTS is that all
    cases have to be within the deadline, including uncommon cases. So
    you have to consider the worst case for, e.g., branch prediction,
    unless you can prove that youget a better case. And I have not heard
    about such proofs for dynamic branch predictors, so static branch
    prediction (hints) may indeed be the way to go.

    Suddenly, all the things done for optimizing in hardware in the
    general case (branch prediction, cache eviction, ...) can make >>>performance for the unusual, but relevant, case worse.

    Actually, for caches with LRU replacement I have heard that they can
    be analysed; that was research coming out of Saarbruecken ~20 years
    ago, and I think they did a spin-off for commercializing it. One
    problem that they had was that the usual n-way set-associative caches
    with n>2 don't have proper LRU replacement, but pseudo-LRU; with these
    caches the guarantees degrade to those of a 2-way cache (with the same
    way sizes). I don't remember if they used that for data or for
    instructions.

    I have not seen real LRU for 2 decades. We mostly use what is called::
    "not recently used" which is a set of n-bits (n==sets):: When then n-th
    bit gets set, all n-bits are cleared.

    I have not heard any advances in WCET since that time, but maybe I
    just went to the wrong meetings.

    I believe there is active research going on on how to overcome this
    "bias for the common case" with today's processors.

    When I heard about the cache work, they also talked about the
    processors and what you know about their execution time. IIRC they
    found that most high-performance architectures of the day were
    problematic.

    Hard Real Time does not like caches if you are within 50% of consuming
    all CPU cycles; and does not like branch predictors, or prefetchers.

    One other thing I remember is that on some PowerPC CPU one could lock
    certain cache lines in the cache, so they will not be replaced. So if
    you use 6 of your 8 ways (of a 32KB cache with 4KB ways) for locking
    stuff into the cache, and the other two ways for performing analysable
    accesses, it's a lot better than a CPU without a cache.

    We used to use Cache line locking to take a set of cache lines (say 4)
    and lock one whose data or tag store was "bad" and that line would go
    from n-way set associative to (n-1)-way set associative.

    Now ARM offers cores with the R profile (e.g., ARMv8-R), where R
    stands for real-time. I have not looked what the properties of these
    cores are. I found it surprising that the big market for them seems
    to be disk drives and such things.

    Why would disk drives NEED Real Time controller ??

    Managing the heads. Error correction. et alia.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to George Neuner on Mon Nov 20 22:33:38 2023
    On 2023-11-20 6:28, George Neuner wrote:
    On Sun, 19 Nov 2023 19:25:50 +0000, mitchalsup@aol.com (MitchAlsup)
    wrote:

    Hard Real Time does not like caches if you are within 50% of consuming
    all CPU cycles; and does not like branch predictors, or prefetchers.

    Please forgive me if I am badly mistaken, but it sounds to me like you
    may be conflating "real time" with "real fast".


    The hard real-time (HRT) domain can be further divided into critical and non-critical. Typically, SW for a critical HRT system must be validated, perhaps even certified, which requires proof or strong arguments that
    deadlines will be met /always/.

    A non-critical HRT system is one where the consequences of deadline
    misses can be tolerated, if such misses are not too frequent, and so one
    can get by with less strict validation. For example, some years ago I
    saw a presentation of an ASML photolithography machine where the SW
    certainly had HRT requirements but where a deadline miss typically led
    to only one of the chips on the wafer being damaged (and later
    discarded), a relatively small loss.

    The SW in that ASML machine did "dry runs" of the processes to "warm up"
    the caches before the actual exposures, and of course monitored deadline
    misses so that it knew which chip(s) might have to be discarded.


    Although they often do go hand in hand, "real time" is only about
    meeting deadlines: speed, load, cache behavior, etc., are relevant
    only to the extent that they cause you to miss deadlines.


    In critical HRT systems, dynamic "accelerators" such as caches and
    predictors are relevant also because they make it much harder to
    verify/prove that deadlines will always be met, for example by making
    static WCET analysis harder or impractical and by making execution-time measurements more variable and less dependable.


    I used to do HRT machine vision industrial QA/QC systems. Unless the machinery is on fire[*], the conveyor keeps going - so these systems
    were "hard" in the sense that there were absolute deadlines to provide results.


    But (I assume) people did not die if a deadline was missed, so I would
    call this a non-critical HRT system.


    Some systems simultaneously checked multiple parts at different stages
    of production and with differing deadlines [...]

    At the same time, the systems had to drive graphic displays showing
    the operator what was going on in near real time. [...]

    Most systems had to be made to work with already existing machinery,
    so I usually had no control over deadlines and compute intervals - I
    simply had to deal with them. Deadlines ranged from ~20ms at the very
    low end to ~900ms at the very high end. Depending on cameras just
    capturing images could take 16..66ms before computation could even
    start. Often there were multiple threads[*] simultaneously performing different inspections on different cameras with different deadlines.


    Those deadlines are fairly relaxed. If you have lots of processor
    margin, you can tolerate the unpredictability of caches etc. in a
    non-critical HRT system.

    I'm not saying that you had an easy job -- from your description it was certainly complex and demanding, especially as you had to find
    lowest-cost suitable HW -- but it seems you did not have to prove that deadlines would always be met, just demonstrate, by testing, that they
    were very rarely missed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Nov 20 18:07:02 2023
    The hard real-time (HRT) domain can be further divided into critical and non-critical.

    I like to use music players as example of real-time, since if you're
    late (even by just a few ms), the result is a failure.
    But the expected harm is just that you may lose users/customers if it
    happens too often.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Tue Nov 21 00:46:59 2023
    Stefan Monnier wrote:

    The hard real-time (HRT) domain can be further divided into critical and
    non-critical.

    I like to use music players as example of real-time, since if you're
    late (even by just a few ms), the result is a failure.

    And yet, modern music is re-timed from the original human produced sounds
    such that each note is perfectly aligned with its supposed time. This gives
    the sound an artificial and mechanical tonality even if it is "more perfect". Almost all of this re-timing is at the millisecond level.

    But the expected harm is just that you may lose users/customers if it
    happens too often.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)