• GNU/Linux Greatness: AVX 512 Assembly

    From Farley Flud@21:1/5 to All on Sat Jan 25 14:37:47 2025
    Assembly language programming is both extremely simple and
    extremely fun.

    Yes, simple. A CPU is a stupid beast and can only perform
    very simple tasks.

    Yes, fun. There is much enjoyment to be had in using these
    simple CPU tasks, like Lego, to construct complex functionality.

    AVX-512 is currently the way to go with assembly programming.
    AVX-512 operates on 512-bits, or 8 doubles, 16 floats, 8 long ints,
    16 ints, or 64 chars (uint_8) simultaneously.

    With GNU/Linux, AVX-512 is totally at your command.

    What follows is a very basic program that essentially does
    nothing. It merely uses AVX-512 assembly to read a data
    block of arbitrary length and then write that block back
    into different memory.

    It's purpose is to illustrate how to step through memory
    at a given stride to read all the data. Since not all data
    is a multiple of 512 bits the code shows to deal with any
    trailing bits.

    For the sake of illustration the following assembly code
    reads/writes 37 unsigned integers. These will fill 2 AVX-512
    registers with 5 uints left over. Those final 5 are handled
    with masking.

    But any data block, up to 2^64 bytes (whew!), can be handled with
    this simple code.

    This program is written in NASM assembly. NASM is the fucking
    best assembler on planet Earth, hands down.

    As I indicated, this program does essentially nothing. There is
    no output. To view the "results" use the GDB debugger or, better,
    the front end DDD. With DDD one can step through the code to watch
    the action unfold.

    Feast thine bloodshot, jaundiced eyeballs on absolutely perfect
    AVX-512 assembly code:

    ==================================
    Begin AVX-512 NASM Assembly
    ==================================

    BITS 64

    segment .text
    global _start

    _start:
    mov r8, data_in
    mov r9, data_out
    mov rbx, qword [stride]
    xor rdx, rdx
    mov rax, qword [N]
    div rbx ; rax = quotient, rdx = remainder
    load:
    vmovdqa32 zmm1, zword [r8]
    vmovdqa32 zword [r9], zmm1
    add r8, 64 ; increment data pointers
    add r9, 64
    dec rax
    jnz load
    xor r11, r11 ; load mask, i.e. only rdx left over to load
    mov r10, -1
    mov rcx, rdx
    shld r11, r10, cl
    kmovq k1, r11;
    vmovdqa32 zmm1{k1}{z}, zword [r8]
    vmovdqa32 zword [r9], zmm1
    exit:
    xor edi,edi
    mov eax,60
    syscall

    segment .data
    align 64
    N: dq 37 ;set length of block and stride
    stride: dq 16
    align 64
    data_in: dd 16 dup (0xefbeadde) ;dummy data
    dd 16 dup (0xfecaafde)
    dd 5 dup (0xefbeadde)

    segment .bss
    alignb 64
    data_out: resd 37

    ==================================
    End AVX-512 NASM Assembly
    ==================================



    --
    Gentoo: The Fastest GNU/Linux Hands Down

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Farley Flud@21:1/5 to Farley Flud on Sat Jan 25 19:19:06 2025
    On Sat, 25 Jan 2025 14:37:47 +0000, Farley Flud wrote:


    Feast thine bloodshot, jaundiced eyeballs on absolutely perfect
    AVX-512 assembly code:


    I cannot resist giving the NASM dump of the assembled code
    (in PIC form of course).

    Feast thine jaundiced eyeballs below.

    Note the data in hexadecimal which reads "DEADBEEF..."

    That is a common device but I also added "DEAFCAFE..."

    These allow me to easily discern where things are.

    Also note the "90" at line 36. NASM pads alignment with byte
    "90" which is the NOP instruction. I should change that padding
    to all zeros but here it does no harm.

    Assembly language is the ultimate (and only) language. Anyone who
    does not embrace assembly is a phony and a fraud and deserves to
    be ostracized, if not worse.

    =================================================

    1 BITS 64
    2
    3 segment .text
    4 global _start
    5
    6 _start:
    7 00000000 49B8- mov r8, data_in
    7 00000002 [4000000000000000]
    8 0000000A 49B9- mov r9, data_out
    8 0000000C [0000000000000000]
    9 00000014 488B1C25[08000000] mov rbx, qword [stride]
    10 0000001C 4831D2 xor rdx, rdx
    11 0000001F 488B0425[00000000] mov rax, qword [N]
    12 00000027 48F7F3 div rbx ; rax = quotient, rdx = remainder
    13 load:
    14 0000002A 62D17D486F08 vmovdqa32 zmm1, zword [r8]
    15 00000030 62D17D487F09 vmovdqa32 zword [r9], zmm1
    16 00000036 4983C040 add r8, 64 ; increment data pointers
    17 0000003A 4983C140 add r9, 64
    18 0000003E 48FFC8 dec rax
    19 00000041 75E7 jnz load
    20 00000043 4D31DB xor r11, r11 ; load mask, i.e. only rdx to load and process
    21 00000046 49C7C2FFFFFFFF mov r10, -1
    22 0000004D 4889D1 mov rcx, rdx
    23 00000050 4D0FA5D3 shld r11, r10, cl
    24 00000054 C4C1FB92CB kmovq k1, r11;
    25 00000059 62D17DC96F08 vmovdqa32 zmm1{k1}{z}, zword [r8]
    26 0000005F 62D17D487F09 vmovdqa32 zword [r9], zmm1
    27 exit:
    28 00000065 31FF xor edi,edi
    29 00000067 B83C000000 mov eax,60
    30 0000006C 0F05 syscall
    31
    32 segment .data
    33 align 64
    34 00000000 2500000000000000 N: dq 37
    35 00000008 1000000000000000 stride: dq 16
    36 00000010 90<rep 30h> align 64
    37 00000040 DEADBEEFDEADBEEFDE- data_in: dd 16 dup (0xefbeadde)
    37 00000049 ADBEEFDEADBEEFDEAD-
    37 00000052 BEEFDEADBEEFDEADBE-
    37 0000005B EFDEADBEEFDEADBEEF-
    37 00000064 DEADBEEFDEADBEEFDE-
    37 0000006D ADBEEFDEADBEEFDEAD-
    37 00000076 BEEFDEADBEEFDEADBE-
    37 0000007F EF
    38 00000080 DEAFCAFEDEAFCAFEDE- dd 16 dup (0xfecaafde)
    38 00000089 AFCAFEDEAFCAFEDEAF-
    38 00000092 CAFEDEAFCAFEDEAFCA-
    38 0000009B FEDEAFCAFEDEAFCAFE-
    38 000000A4 DEAFCAFEDEAFCAFEDE-
    38 000000AD AFCAFEDEAFCAFEDEAF-
    38 000000B6 CAFEDEAFCAFEDEAFCA-
    38 000000BF FE
    39 000000C0 DEADBEEFDEADBEEFDE- dd 5 dup (0xefbeadde)
    39 000000C9 ADBEEFDEADBEEFDEAD-
    39 000000D2 BEEF
    40
    41 segment .bss
    42 alignb 64
    43 00000000 <res 94h> data_out: resd 37

    =====================================================================





    --
    Gentoo: The Fastest GNU/Linux Hands Down

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From DFS@21:1/5 to Lameass Larry on Sun Jan 26 11:03:15 2025
    On 1/25/2025 9:37 AM, Lameass Larry wrote:


    essentially does nothing

    does essentially nothing


    Feeb In A Nutshell

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tyrone@21:1/5 to Farley Flud on Sun Jan 26 17:18:16 2025
    On Jan 25, 2025 at 9:37:47 AM EST, "Farley Flud" <fflud@gnu.rocks> wrote:

    Yes, simple. A CPU is a stupid beast and can only perform
    very simple tasks.

    Perfectly describes you.

    Listen up, Junior. I was doing assembly language programming in 1980. Yes, 45 years ago. There is so much more to assembly language than just doing simple block moves of memory. That's like saying you can build a house because you
    can move a stack of bricks from point A to point B.

    You toddling in here and proclaiming "assembly is so great!" is only further proof of what a clueless kiddie you are. What's next? Will you offer us your opinion on color monitors versus green screen monitors? Burroughs EBCDIC versus IBM EBCDIC? COBOL versus Fortran?

    What a twat.

    BTW your "program" looks like more copy & paste from The Physics Forum.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Farley Flud@21:1/5 to Tyrone on Sun Jan 26 17:43:48 2025
    On Sun, 26 Jan 2025 17:18:16 +0000, Tyrone wrote:


    I was doing assembly language programming in 1980.


    Big fucking deal.

    You couldn't even begin to fathom current AVX-512 assembly.

    And you know it.

    Ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha!

    Lack of brains, head of bone. Must be Tyrone.

    Ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha!


    --
    Systemd: solving all the problems that you never knew you had.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?St=C3=A9phane?= CARPENTIE@21:1/5 to All on Sat Feb 1 10:33:54 2025
    Le 26-01-2025, Tyrone <none@none.none> a écrit :
    On Jan 25, 2025 at 9:37:47 AM EST, "Farley Flud" <fflud@gnu.rocks> wrote:

    Yes, simple. A CPU is a stupid beast and can only perform
    very simple tasks.

    Perfectly describes you.

    Nope. He can't even do simple task. Every simple task for anyone start
    as a huge challenge when he tries to perform it.

    --
    Si vous avez du temps à perdre :
    https://scarpet42.gitlab.io

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)