Forum: >>> Magnum BBS <<<

Performance monitoring (was: Efficiency of in-order vs. OoO)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 16:47:02 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy. And
that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Tue Mar 26 17:29:00 2024

In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

There can be considerable confusion on this point. In the early days of
Intel VTune, it would only work on small and simple programs, but Intel
sent one of the lead developers to visit the UK with it, expecting that
it would instantly find huge speed-ups in my employers' code.

What happened was that VTune crashed almost instantly when faced with
something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Mar 26 18:47:38 2024

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy. And
that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

Quit being so CPU-centric.

You also need measurement on how many of which transactions few across
the bus, DRAM use analysis, and PCIe usage to fully tune the system.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Tue Oct 1 18:50:51 2024

On Tue, 26 Mar 2024 18:47:38 +0000, MitchAlsup1 wrote:

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>>>scott@slp53.sl.home (Scott Lurndal) writes:

The biggest demand is from the OS vendors. Hardware folks have >>>>>simulation and emulators.

You don't want to use a full-blown microarchitectural emulator for a >>>>long-running program.

Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >>>bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.

Their target is not application analysis.

This sounds like hardware folks that are only concerned with
memory-bound programs.

I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy.

That is the job of the architects, the designers are more concerned
with their (myriad of) sequencers and how they interact with each other.

And

that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler
utilization, etc.

Quit being so CPU-centric.

You also need measurement on how many of which transactions few across
the bus, DRAM use analysis, and PCIe usage to fully tune the system.

Every block big enough to have a unique name (i.e., dram CONTROLLER)
should have its own set pf PMCs. In general, sequencers are too small
for their own PMCs. {{the PMCs would be larger than the sequencers
they measure.}}

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	06:50:19
Calls:	10,386
Calls today:	1
Files:	14,058
Messages:	6,416,638

Performance monitoring (was: Efficiency of in-order vs. OoO)

Who's Online

Recent Visitors

System Info