Forum: >>> Magnum BBS <<<

WCET analysis (was: Privilege Levels Below User)

From Stefan Monnier@21:1/5 to All on Tue Jun 11 17:14:15 2024

Not always. If the mistakenly speculated cache-fetch /evicted/ some other data from the (finite-sized) cache, and the evicted data are needed later on the /true/ execution path, the mistakenly speculated fetch has a /negative/ effect on performance. (This kind of "timing anomaly" is very bothersome in static WCET analysis.)

BTW, I heard that nowadays "static WCET analysis" is typically based on statistical models of the performance of various parts of the code.
So they don't give an ironclad worst case any more (contrary to what
you'd get back in the day where you could count cycles and there were no caches), but rather one that is "very unlikely" to be wrong.

I remember papers doing "real WCET" by analyzing the code to detect
those memory accesses which were guaranteed to be cache hits. Do you
know if this is still done nowadays?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 11 17:51:14 2024

[...] is performed which *completely bypasses* the cache; [...]

Yes, critical word first.

How important is this nowadays?
From what I can tell, the bandwidth-time of transferring a whole cache
line is extremely short compared to the latency to the first word, so
I assumed that whether the critical word is send first or not wouldn't
make much of a difference.

Is my intuition wrong?
Or is it simply that the difference is small but the cost is
even smaller?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 11 22:50:36 2024

Stefan Monnier wrote:

[...] is performed which *completely bypasses* the cache; [...]

Yes, critical word first.

How important is this nowadays?

With interconnect widths of ½ to 1 cache line per cycle, its utility
has
been significantly reduced.

From what I can tell, the bandwidth-time of transferring a whole cache
line is extremely short compared to the latency to the first word, so
I assumed that whether the critical word is send first or not wouldn't
make much of a difference.

When one has 64 cores on a die, the interconnect BW basically has to
be about 1 cache line per cycle.

If we have cores with a 1% miss rate in the L2 cache, every 100 cycles
we have to service 64 or more misses. So, you need a BW that big.

Then once you have to mux out any given word, you might just as well
hold the line and service outstanding requests from the buffer.

The more CPUs per die and the greater the instruction width per cycle,
the greater the BW of the cores<->memory interconnect.

When the width is lower, CWF is more important.

Is my intuition wrong?
Or is it simply that the difference is small but the cost is
even smaller?

When you can put 100 GBOoO cores on a chip, you should be able to
feed them at appropriate rates.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Stefan Monnier on Wed Jun 12 10:08:14 2024

On 2024-06-12 0:14, Stefan Monnier wrote:

Not always. If the mistakenly speculated cache-fetch /evicted/ some other
data from the (finite-sized) cache, and the evicted data are needed later on >> the /true/ execution path, the mistakenly speculated fetch has a /negative/ >> effect on performance. (This kind of "timing anomaly" is very bothersome in >> static WCET analysis.)

BTW, I heard that nowadays "static WCET analysis" is typically based on statistical models of the performance of various parts of the code.
So they don't give an ironclad worst case any more (contrary to what
you'd get back in the day where you could count cycles and there were no caches), but rather one that is "very unlikely" to be wrong.

I remember papers doing "real WCET" by analyzing the code to detect
those memory accesses which were guaranteed to be cache hits. Do you
know if this is still done nowadays?

I haven't followed the WCET-analysis field closely for a few years, so I
may not be up to date. But here is my understanding, with apologies for
the long text:

Fully static, non-probabilistic WCET analysis is no longer possible,
practical or profitable for most "big" processors because of their
complex microarchitectures, with out-of-order execution and speculation
being the most bothersome parts. The difficulty of getting accurate
information about the microarchitectures is a large factor, too, as is
the cost of creating a model of such a microarchitecture, compared to
the small market for such WCET tools for a particular microarchitecture.

For in-order processors static analysis of cache hits and misses is very possible, and has a great deal of good theory behind it. It works quite
well for I-caches, but for D-caches it is much harder and more
pessimistic because of the difficulty of predicting the addresses of
data accesses, which depends a lot on the nature of the application
being analyzed. Cache analysis also becomes harder in the presence of preemption, multiple cores and multiple cache levels. However, any
static WCET analysis tool with ambitions to give safe WCET bounds will certainly include static cache analysis, and can then give a safe upper
bound on the WCET, but in some cases it may be quite pessimistic (over-estimated).

Some time ago, even before the complex microarchitectures became an
acute problem, a class of "WCET" tools was developed that rely on a
combination of execution-time measurements of the basic code blocks,
static analysis of the control flow graphs and possible execution paths,
and a statistical method called extreme-value statistics that aims to
estimate the probability of outlier values (extreme and unlikely values)
for the sum of the observed distributions of the execution times of the
basic blocks. These methods do not need a microarchitecture model for
the processor, but do need a way to measure basic-block execution times
(which is doubtful in an OoO and speculative processor) and a "good
enough" test suite. AIUI, these tools are successful and are much used. However, it is not clear that the probability they compute for exceeding
their WCET "bound" is accurate, because the statistical theory depends
on the distributions of the execution times of the individual basic code
blocks being independent and uncorrelated, which is false for the
ordinary sort of microarchitectures.

To overcome that problem, there has been work to create "randomized" HW components, for example randomized caches, where it is possible to make statistical models of each component and the statistics of different
components are known, by design and construction, to be independent and uncorrelated. However, I don't know how far this work has progressed,
and how completely a processor can be randomized in this sense. A couple
of years ago I saw that one practical application of this work simply
used ordinary hardware but proposed to pick a new random memory lay-out
(link map) of the application every time the application was loaded and started, which seems a cop-out.

Recently a new type of "WCET" tool has appeared that does not claim to
do more than "estimate" or "explore" the execution time of an
application on a processor that is described only by its visible
architecture (instruction set) and some capacity and performance
numbers. I don't know exactly how these tools work, but it seems likely
that they use a simplified, in-order, non-speculative model of the
processor and can then apply the classical "static WCET" analysis
methods. They don't claim to provide an accurate or even probabilistic
WCET bound, but rather help the user understand which parts of the
application are likely to use most of the time, and give some confidence
that the application will be fast enough on the real processor. This may
be what most developers want and are content with.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:35
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

WCET analysis (was: Privilege Levels Below User)

Who's Online

Recent Visitors

System Info