• WCET analysis (was: Privilege Levels Below User)

    From Stefan Monnier@21:1/5 to All on Tue Jun 11 17:14:15 2024
    Not always. If the mistakenly speculated cache-fetch /evicted/ some other data from the (finite-sized) cache, and the evicted data are needed later on the /true/ execution path, the mistakenly speculated fetch has a /negative/ effect on performance. (This kind of "timing anomaly" is very bothersome in static WCET analysis.)

    BTW, I heard that nowadays "static WCET analysis" is typically based on statistical models of the performance of various parts of the code.
    So they don't give an ironclad worst case any more (contrary to what
    you'd get back in the day where you could count cycles and there were no caches), but rather one that is "very unlikely" to be wrong.

    I remember papers doing "real WCET" by analyzing the code to detect
    those memory accesses which were guaranteed to be cache hits. Do you
    know if this is still done nowadays?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jun 11 17:51:14 2024
    [...] is performed which *completely bypasses* the cache; [...]
    Yes, critical word first.

    How important is this nowadays?
    From what I can tell, the bandwidth-time of transferring a whole cache
    line is extremely short compared to the latency to the first word, so
    I assumed that whether the critical word is send first or not wouldn't
    make much of a difference.

    Is my intuition wrong?
    Or is it simply that the difference is small but the cost is
    even smaller?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 11 22:50:36 2024
    Stefan Monnier wrote:

    [...] is performed which *completely bypasses* the cache; [...]
    Yes, critical word first.

    How important is this nowadays?

    With interconnect widths of ½ to 1 cache line per cycle, its utility
    has
    been significantly reduced.

    From what I can tell, the bandwidth-time of transferring a whole cache
    line is extremely short compared to the latency to the first word, so
    I assumed that whether the critical word is send first or not wouldn't
    make much of a difference.

    When one has 64 cores on a die, the interconnect BW basically has to
    be about 1 cache line per cycle.

    If we have cores with a 1% miss rate in the L2 cache, every 100 cycles
    we have to service 64 or more misses. So, you need a BW that big.

    Then once you have to mux out any given word, you might just as well
    hold the line and service outstanding requests from the buffer.

    The more CPUs per die and the greater the instruction width per cycle,
    the greater the BW of the cores<->memory interconnect.

    When the width is lower, CWF is more important.

    Is my intuition wrong?
    Or is it simply that the difference is small but the cost is
    even smaller?

    When you can put 100 GBOoO cores on a chip, you should be able to
    feed them at appropriate rates.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Stefan Monnier on Wed Jun 12 10:08:14 2024
    On 2024-06-12 0:14, Stefan Monnier wrote:
    Not always. If the mistakenly speculated cache-fetch /evicted/ some other
    data from the (finite-sized) cache, and the evicted data are needed later on >> the /true/ execution path, the mistakenly speculated fetch has a /negative/ >> effect on performance. (This kind of "timing anomaly" is very bothersome in >> static WCET analysis.)

    BTW, I heard that nowadays "static WCET analysis" is typically based on statistical models of the performance of various parts of the code.
    So they don't give an ironclad worst case any more (contrary to what
    you'd get back in the day where you could count cycles and there were no caches), but rather one that is "very unlikely" to be wrong.

    I remember papers doing "real WCET" by analyzing the code to detect
    those memory accesses which were guaranteed to be cache hits. Do you
    know if this is still done nowadays?


    I haven't followed the WCET-analysis field closely for a few years, so I
    may not be up to date. But here is my understanding, with apologies for
    the long text:

    Fully static, non-probabilistic WCET analysis is no longer possible,
    practical or profitable for most "big" processors because of their
    complex microarchitectures, with out-of-order execution and speculation
    being the most bothersome parts. The difficulty of getting accurate
    information about the microarchitectures is a large factor, too, as is
    the cost of creating a model of such a microarchitecture, compared to
    the small market for such WCET tools for a particular microarchitecture.

    For in-order processors static analysis of cache hits and misses is very possible, and has a great deal of good theory behind it. It works quite
    well for I-caches, but for D-caches it is much harder and more
    pessimistic because of the difficulty of predicting the addresses of
    data accesses, which depends a lot on the nature of the application
    being analyzed. Cache analysis also becomes harder in the presence of preemption, multiple cores and multiple cache levels. However, any
    static WCET analysis tool with ambitions to give safe WCET bounds will certainly include static cache analysis, and can then give a safe upper
    bound on the WCET, but in some cases it may be quite pessimistic (over-estimated).

    Some time ago, even before the complex microarchitectures became an
    acute problem, a class of "WCET" tools was developed that rely on a
    combination of execution-time measurements of the basic code blocks,
    static analysis of the control flow graphs and possible execution paths,
    and a statistical method called extreme-value statistics that aims to
    estimate the probability of outlier values (extreme and unlikely values)
    for the sum of the observed distributions of the execution times of the
    basic blocks. These methods do not need a microarchitecture model for
    the processor, but do need a way to measure basic-block execution times
    (which is doubtful in an OoO and speculative processor) and a "good
    enough" test suite. AIUI, these tools are successful and are much used. However, it is not clear that the probability they compute for exceeding
    their WCET "bound" is accurate, because the statistical theory depends
    on the distributions of the execution times of the individual basic code
    blocks being independent and uncorrelated, which is false for the
    ordinary sort of microarchitectures.

    To overcome that problem, there has been work to create "randomized" HW components, for example randomized caches, where it is possible to make statistical models of each component and the statistics of different
    components are known, by design and construction, to be independent and uncorrelated. However, I don't know how far this work has progressed,
    and how completely a processor can be randomized in this sense. A couple
    of years ago I saw that one practical application of this work simply
    used ordinary hardware but proposed to pick a new random memory lay-out
    (link map) of the application every time the application was loaded and started, which seems a cop-out.

    Recently a new type of "WCET" tool has appeared that does not claim to
    do more than "estimate" or "explore" the execution time of an
    application on a processor that is described only by its visible
    architecture (instruction set) and some capacity and performance
    numbers. I don't know exactly how these tools work, but it seems likely
    that they use a simplified, in-order, non-speculative model of the
    processor and can then apply the classical "static WCET" analysis
    methods. They don't claim to provide an accurate or even probabilistic
    WCET bound, but rather help the user understand which parts of the
    application are likely to use most of the time, and give some confidence
    that the application will be fast enough on the real processor. This may
    be what most developers want and are content with.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)