• =?UTF-8?Q?More_of_my_philosophy_about_fused_multiply=E2=80=93add_=28FM?

    From Amine Moulay Ramdane@21:1/5 to All on Tue Mar 14 14:11:48 2023
    Hello,




    More of my philosophy about fused multiply–add (FMA or fmadd) and about AVX-512 and about Zen 4 and about technology and more of my thoughts..

    I am a white arab from Morocco, and i think i am smart since i have also invented many scalable algorithms and algorithms..



    I think i have to define fused multiply–add (FMA or fmadd) so that you understand my below thoughts, so here it is:


    A fused multiply–add (FMA or fmadd) is a floating-point multiply–add operation performed in one step, with a single rounding, so there is
    of course an AVX-512 fused multiply–add (FMA) that has a bandwidth of 2x8FMA in one step, with a single rounding, so it is why i am giving you
    my calculations below, and of course you have to read my below writing
    about how AVX-512 implementation in the new AMD Zen 4 from USA is unexpectedly good, so i invite you to read my below thoughts:


    More of my philosophy about AVX-512 and about Zen 4 and about technology and more of my thoughts..


    "AVX-512 implementation in Zen 4 is unexpectedly good, despite relying on the double pumping of 256-bit units. Most of the operations are fast (they don’t have bad latencies), and the bandwidth in terms of 512-bit instructions processed per cycle is
    also good in the context of using 256-bit units. AMD has particularly fast Conflict Detection operations and Mask Registry handling, which has significantly higher performance than Intel’s implementation. Integer 64×64 multiplication (vmpmullq) is
    extremely fast too."


    Read more here:

    https://www.hwcooling.net/en/how-good-is-amds-avx-512-does-it-improve-zen-4-performance/


    And according to the following benchmark results, OpenBLAS is about 2.85 times faster than the default BLAS and MKL is about 3.25 times faster:


    Read more here:

    https://csantill.github.io/RPerformanceWBLAS/



    So i think OpenBLAS is good, so you can download it from the following web link:

    https://www.openblas.net/



    Also i have just looked at the following article about the
    benchmark of Intel Xeon Scalable Processor vs. Nvidia V100 GPU,
    here it is:

    https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-intel-xeon-scalable-processor-vs-nvidia-v100-gpu/

    So i think that the main problem of Intel Xeons in the above benchmark is the memory bandwidth, so i think that the number of GFLOPs of Intel Xeons in the above benchmark is a result of multiplying the frequency of the CPU by the number of cores and by
    2x8FMA, i mean fused multiply–add (FMA) instructions for floating-point scalar and SIMD operations, and it is giving a result of 2,240 GFLOPs, so then if you want to have a powerful computer that also have a good memory bandwidth, i advice you to use a
    new two socket motherboard for new Intel Xeon processors that support a memory bandwidth of like 5.2 GT/s for DDR5 x 8 bytes per channel x 12 channels for one socket, and that equals 499.2 GB per second or 998.4 GB per second for two sockets, and this
    will equal the memory bandwidth of the Nvidia V100 PCIe (Volta) in the above benchmark , and this will solve the memory bandwidth problem, and of course the two socket motherboard for a two new 64 cores Intel Xeon 3.4 Ghz will give you around 6963 GFLOPs
    as the Nvidia V100 PCIe (Volta).



    Thank you,
    Amine Moulay Ramdane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)