• More of my philosophy about x86 CPUs and about cache prefetching and mo

    From Amine Moulay Ramdane@21:1/5 to All on Wed Jun 15 15:58:28 2022
    Hello,


    More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts..

    I am a white arab, and i think i am smart since i have also
    invented many scalable algorithms and algorithms..


    I think i am highly smart and today i will talk about
    the how to prefetch data into the caches on x86 microprocessors:

    So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches:

    So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU
    register eax of the x86 microprocessor:

    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [eax]
    end;


    For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU
    register rcx of the x86 microprocessor:

    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [rcx]
    end;

    And you can also prefetch on level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and i
    say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain
    many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19
    simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for
    memory.

    And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link:

    https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w


    And more of my philosophy about Stacktrace and more of my thoughts..

    I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making
    it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing.

    Other than that i have just looked at Delphi and Freepascal and
    i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is
    a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows:

    https://github.com/r3code/lazarus-exception-logger

    But notice carefully that the Delphi ones run just in Windows:

    https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace


    So i think that since a much more enhanced Stacktrace is important,
    so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures.

    Also the Free Pascal Developer team is pleased to finally announce the addition of a long awaited feature, though to be precise it's two different, but very much related features: Function References and Anonymous Functions. These two features can be
    used independantly of each other, but their greatest power they unfold when used together.

    Read about it here:

    https://forum.lazarus.freepascal.org/index.php/topic,59468.msg443370.html#msg443370

    More of my philosophy about my Winmenus using Wingraph and using CRT and more of my thoughts..


    WinMenus using wingraph version 1.23

    Author: Amine Moulay Ramdane


    You can download my WinMenus using wingraph from my website here:

    https://sites.google.com/site/scalable68/winmenus-using-wingraph

    And you can download my Winmenus using CRT from here:

    https://sites.google.com/site/scalable68/winmenus


    I have implemented Winmenus using wingraph, this one is graphical, i have also included an OpenGL demo and other demos , just execute the real3d1.exe executable inside the zipfile to see how it is powerful.

    Now it is both compatible with Delphi and with FreePascal, now it works with Delphi tokyo and above, but there is only one difference between Delphi and FreePascal, the double click with the left button of the mouse of freepascal is replaced in Delphi
    with a one click with the middle button of the mouse to avoid a problem.

    Description:

    Drop-Down Menu widget using the Wingraph unit. Please look at the real3d1.pas demo inside the zip file to know how to use it.

    Use the 'Delete' on the keyboard to delete the items

    Use the 'Insert' on the keyboard to insert the items

    and use the 'Up' and 'Down' and 'PageUp and 'PageDown' on the keyboard to scroll ..

    and use the 'Tab' on the keyboard to switch between the Drop Down Menus

    and 'Enter' on the keyboard or mouse double click(for FreePascal) or middle mouse click(for Delphi) to select an item..

    and the 'Esc' on the keyboard or right mouse click to exit..

    and the 'F1' on keyboard to delete all the items from the list

    and right arrow and left arrow to scroll on the left or on the right

    You can search with SearchName() and NextSearch() methods and now the search with wildcards inside the Widget is working perfectly.

    Winmenus is event driven, i have to explain it more to you to understand more...

    At first you have to create your Widget menu by executing something like this:

    Menu1:=TMenu.create(5,5);

    This will create a Widget menu at the coordinate in characters (x,y) = (5,5)

    After that you have to set your callbacks,cause my Winmenus is event driven, so you have to do it like this:

    Menu1.SetCallbacks(insert,updown);

    The SetCallbacks() method will set your callbacks, the first callback

    is the callback that will be executed when the insert key is pressed and here above it is the "insert()" function, and the second callback is the callback that will be called when the up and down keys are pressed and here above it is the function "updown"
    , the remaining callbacks that you can assign are the following keys: Delete and F1 to F12.

    After that you have to set your callback function, cause my Winmenus is event driven, so you have to add an item with AddItem() and set the callback function at the same time, like this:

    AddItem('First 3D opengl demo',test1);

    test1 will be the callback function.

    When you execute execute(false) with a parameter equal to false my Winmenus widget will draw your menu without waiting for your input and events, when you set the parameter of the execute() method to true it will wait for your input and events, if the
    parameter of the execute method is true and the returned value of the execute method is ctTab that means you have pressed on the Tab key.. if the returned value is ctExit that means you have pressed on the Escape key to exit.

    I have also included my Graph3D unit for 3D graphism that i have enhanced and that looks like graph unit of Turbo Pascal, and i have included GUI.pas unit that comes with more GUI components, please look at the demo.pas demo inside the zip file to know
    how to use my Winmenus unit and GUI unit to do GUI.

    More explanation about my Graph3D unit that i have included inside the zipfile:


    About the Graph3D unit, it looks like the Graph unit of turbo pascal but it's for 3D graphism, and to understand the variables Rho,Theta,Phi,DE of the InitProj() method of Graph3D unit, please read what's below:
    When you run the demo program that is called cube3d.pas , here is the keys of the keyboard that permits you to run it:

    Right arrow: to increase the angle Theta(that is the variable Theta) to move in the plane XY anti-clockwise.

    Left arrow: to decrease the angle Theta(that is the variable Theta) to move in the plane XY clockwise.

    Top arrow: to increase the Phi(that is variable Phi) angle to move up and look at the cube from above.

    Bottom arrow: to decrease the Phi(that is variable Phi) angle to move down and look at the cube from below.

    Key A: to decrease R(that is variable Rho) to get closer to the cube, we can even penetrate it and pass behind, in the latter case the image obtained will be the opposite.

    Key E: to increase R(that is variable Rho) to move away from the cube.
    Key +: to increase the distance D(that is variable DE) between the screen and the eye, this causes an enlargement of the image.

    Key -: to decrease the distance D(that is variable DE) between the screen and the eye, this causes the image to shrink and possibly be an inverse magnification if D becomes negative, ie if the screen passes behind the observer.

    Key C: to move from perspective projection to parallel projection and vice versa. During this toggle the parameters which were current are stored in auxiliary variables (RhoResp, DEResp for the perspective and RhoPara, DEPara for the parallel projection)
    in order to be able to return to it correctly afterwards.

    Key F: To end the running program.



    Thank you,
    Amine Moulay Ramdane.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From "I don't know@21:1/5 to All on Sun Oct 16 02:41:33 2022
    This is tech.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Amine Moulay Ramdane@21:1/5 to All on Wed Oct 26 12:06:07 2022
    Hello,



    More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts..

    I am a white arab, and i think i am smart since i have also
    invented many scalable algorithms and algorithms..


    I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and today i will talk about
    the how to prefetch data into the caches on x86 microprocessors:

    So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches:

    So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU
    register eax of the x86 microprocessor:

    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [eax]
    end;


    For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU
    register rcx of the x86 microprocessor:

    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [rcx]
    end;


    And you can request a loading of 256 bytes in advance into the caches, and it can be efficient, by doing this:

    So for 32 bit Delphi and Freepascal compilers you do this:


    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [eax]+256
    end;


    So for 64 bit Delphi and Freepascal compilers you do this:


    procedure Prefetch(p : pointer); register;
    asm
    prefetchT1 byte ptr [rcx]+256
    end;


    So you can also prefetch into level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and
    i say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain
    many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19
    simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for
    memory.

    And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link:

    https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w


    And more of my philosophy about Stacktrace and more of my thoughts..

    I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making
    it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing.

    Other than that i have just looked at Delphi and Freepascal and
    i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is
    a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows:

    https://github.com/r3code/lazarus-exception-logger

    But notice carefully that the Delphi ones run just in Windows:

    https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace


    So i think that since a much more enhanced Stacktrace is important,
    so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures.

    Also the Free Pascal Developer team is pleased to finally announce the addition of a long awaited feature, though to be precise it's two different, but very much related features: Function References and Anonymous Functions. These two features can be
    used independantly of each other, but their greatest power they unfold when used together.

    Read about it here:

    https://forum.lazarus.freepascal.org/index.php/topic,59468.msg443370.html#msg443370

    More of my philosophy about the AMD Epyc CPU and more of my thoughts..




    I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, if you want to be serious about buying
    a CPU and motherboard, i advice you to buy the following AMD Epyc 7313p Milan 16 cores CPU that costs much less(around 1000 US dollars) and that is reliable and fast, since it is a 16 cores CPU and it supports standard ECC memory and it supports 8 memory
    channels, here it is:

    https://en.wikichip.org/wiki/amd/epyc/7313p

    And the good Supermicro motherboard for it that supports the Epyc Milan 7003 is the following:

    https://www.newegg.com/supermicro-mbd-h12ssl-nt-o-supports-single-amd-epyc-7003-7002-series-processor/p/1B4-005W-00911?Description=amd%20epyc%20motherboard&cm_re=amd_epyc%20motherboard-_-1B4-005W-00911-_-Product


    And the above AMD Epyc 7313p Milan 16 cores CPU can be configured
    as NUMA using the good Supermicro motherboard above as following:

    This setting enables a trade-off between minimizing local memory latency for NUMAaware or highly parallelizable workloads vs. maximizing per-core memory bandwidth for non-NUMA-friendly workloads. The default configuration (one NUMA domain per socket) is
    recommended for most workloads. NPS4 is recommended for HPC and other highly parallel workloads.Here is the detail introduction for such options:

    • NPS0: Interleave memory accesses across all channels in both sockets (not recommended)

    • NPS1: Interleave memory accesses across all eight channels in each socket, report one NUMA node per socket (unless L3 Cache as NUMA is enabled)

    • NPS2: Interleave memory accesses across groups of four channels (ABCD and EFGH) in each socket, report two NUMA nodes per socket (unless L3 Cache as NUMA is enabled)

    • NPS4: Interleave memory accesses across pairs of two channels (AB, CD, EF and GH) in each socket, report four NUMA nodes per socket (unless L3 Cache as NUMA is enabled)


    And of course you have to read my following writing about DDR5 memory that is not a fully ECC memory:

    "On-die ECC: The presence of on-die ECC on DDR5 memory has been the subject of many discussions and a lot of confusion among consumers and the press alike. Unlike standard ECC, on-die ECC primarily aims to improve yields at advanced process nodes,
    thereby allowing for cheaper DRAM chips. On-die ECC only detects errors if they take place within a cell or row during refreshes. When the data is moved from the cell to the cache or the CPU, if there’s a bit-flip or data corruption, it won’t be
    corrected by on-die ECC. Standard ECC corrects data corruption within the cell and as it is moved to another device or an ECC-supported SoC."

    Read more here to notice it:

    https://www.hardwaretimes.com/ddr5-vs-ddr4-ram-quad-channel-and-on-die-ecc-explained/


    So if you want to get serious and professional you can buy the above
    AMD Epyc 7313p Milan 16 cores CPU with the Supermicro motherboard that supports it and that i am advicing and that supports the fully ECC memory and that supports 8 memory channels.

    And of course you can read my thoughts about technology in the following web link:

    https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4


    And of course you have to read my following thoughts that also show how powerful is to use 8 memory channels:



    I have just said the following:

    --

    More of my philosophy about the new Zen 4 AMD Ryzen™ 9 7950X and more of my thoughts..


    So i have just looked at the new Zen 4 AMD Ryzen™ 9 7950X CPU, and i invite you to look at it here:

    https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x

    But notice carefully that the problem is with the number of supported memory channels, since it just support two memory channels, so it is not good, since for example my following Open source software project of Parallel C++ Conjugate Gradient Linear
    System Solver Library that scales very well is scaling around 8X on my 16 cores Intel Xeon with 2 NUMA nodes and with 8 memory channels, but it will not scale correctly on the
    new Zen 4 AMD Ryzen™ 9 7950X CPU with just 2 memory channels since it is also memory-bound, and here is my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well and i invite you to
    take carefully a look at it:

    https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

    So i advice you to buy an AMD Epyc CPU or an Intel Xeon CPU that supports 8 memory channels.

    ---


    And of course you can use the next Twelve DDR5 Memory Channels for Zen 4 AMD EPYC CPUs so that to scale more my above algorithm, and read about it here:

    https://www.tomshardware.com/news/amd-confirms-12-ddr5-memory-channels-on-genoa


    And here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm of my Parallel C++ Conjugate Gradient Linear System Solver Library is scalable:

    If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those
    functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture , and it also make it scale on the memory channels, what i am doing is scrambling the array parts using a probabilistic function and what i have
    noticed that this probabilistic mechanism i
  • From U a@21:1/5 to All on Wed Oct 26 14:07:40 2022
    Tere...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)