• Performance monitoring (was: Efficiency of in-order vs. OoO)

    From Anton Ertl@21:1/5 to Scott Lurndal on Tue Mar 26 16:47:02 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Tue Mar 26 17:29:00 2024
    In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    There can be considerable confusion on this point. In the early days of
    Intel VTune, it would only work on small and simple programs, but Intel
    sent one of the lead developers to visit the UK with it, expecting that
    it would instantly find huge speed-ups in my employers' code.

    What happened was that VTune crashed almost instantly when faced with
    something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Mar 26 18:47:38 2024
    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    Quit being so CPU-centric.

    You also need measurement on how many of which transactions few across
    the bus, DRAM use analysis, and PCIe usage to fully tune the system.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Oct 1 18:50:51 2024
    On Tue, 26 Mar 2024 18:47:38 +0000, MitchAlsup1 wrote:

    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>>>scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>>>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >>>bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy.

    That is the job of the architects, the designers are more concerned
    with their (myriad of) sequencers and how they interact with each other.

    And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler
    utilization, etc.

    Quit being so CPU-centric.

    You also need measurement on how many of which transactions few across
    the bus, DRAM use analysis, and PCIe usage to fully tune the system.

    Every block big enough to have a unique name (i.e., dram CONTROLLER)
    should have its own set pf PMCs. In general, sequencers are too small
    for their own PMCs. {{the PMCs would be larger than the sequencers
    they measure.}}

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)