• Bit reversal in AVX2

    From James Van Buskirk@21:1/5 to All on Sun May 22 03:05:58 2022
    Just for fun I thought I would try various strategies for permuting
    an array of single-precision floating point numbers using AVX2.
    bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
    synthesis with vperm2i128 and has a hard limit of 24 clocks to
    bit-reverse a 64-element array because all of these operations
    use pipeline 5.

    D:\gfortran\james\bitrev>type bitrev1.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public bitrev1
    bitrev1:
    sub rsp, 56
    vmovdqu [rsp+32], xmm6
    vmovdqu [rsp+16], xmm7
    vmovdqu [rsp], xmm8

    vmovdqu ymm0, [rcx]
    vmovdqu ymm1, [rcx+32]
    vmovdqu ymm2, [rcx+64]
    vmovdqu ymm3, [rcx+96]
    vmovdqu ymm4, [rcx+128]
    vmovdqu ymm5, [rcx+160]
    vmovdqu ymm6, [rcx+192]
    vmovdqu ymm8, [rcx+224]

    vperm2i128 ymm7, ymm0, ymm1, 32
    vperm2i128 ymm0, ymm0, ymm1, 49
    vperm2i128 ymm1, ymm2, ymm3, 32
    vperm2i128 ymm2, ymm2, ymm3, 49
    vperm2i128 ymm3, ymm4, ymm5, 32
    vperm2i128 ymm4, ymm4, ymm5, 49
    vperm2i128 ymm5, ymm6, ymm8, 32
    vperm2i128 ymm6, ymm6, ymm8, 49

    vpunpckldq ymm8, ymm7, ymm3
    vpunpckhdq ymm7, ymm7, ymm3
    vpunpckldq ymm3, ymm0, ymm4
    vpunpckhdq ymm0, ymm0, ymm4
    vpunpckldq ymm4, ymm1, ymm5
    vpunpckhdq ymm1, ymm1, ymm5
    vpunpckldq ymm5, ymm2, ymm6
    vpunpckhdq ymm6, ymm2, ymm6

    vpunpcklqdq ymm2, ymm8, ymm4
    vpunpckhqdq ymm8, ymm8, ymm4
    vpunpcklqdq ymm4, ymm7, ymm1
    vpunpckhqdq ymm7, ymm7, ymm1
    vpunpcklqdq ymm1, ymm3, ymm5
    vpunpckhqdq ymm3, ymm3, ymm5
    vpunpcklqdq ymm5, ymm0, ymm6
    vpunpckhqdq ymm0, ymm0, ymm6

    vmovdqu [rcx], ymm2
    vmovdqu [rcx+32], ymm1
    vmovdqu [rcx+64], ymm4
    vmovdqu [rcx+96], ymm5
    vmovdqu [rcx+128], ymm8
    vmovdqu [rcx+160], ymm3
    vmovdqu [rcx+192], ymm7
    vmovdqu [rcx+224], ymm0

    epilog:
    vmovdqu xmm6, [rsp+32]
    vmovdqu xmm7, [rsp+16]
    vmovdqu xmm8, [rsp]
    add rsp, 56
    ret

    D:\gfortran\james\bitrev>fasm bitrev1.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    1 passes, 357 bytes.

    bitrev2.asm is the ugliest version, using permutations up front
    and in back to shift the data into position and then vpblendd
    to shift data between registers. It has the potential to be fastest
    because only 14 operations require pipeline 5.

    D:\gfortran\james\bitrev>type bitrev2.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public bitrev2
    bitrev2:
    sub rsp, 56
    vmovdqu [rsp+32], xmm6
    vmovdqu [rsp+16], xmm7
    vmovdqu [rsp], xmm8

    vmovdqu ymm0, [rcx]
    vmovdqu ymm4, [rcx+32]
    vmovdqu ymm2, [rcx+64]
    vmovdqu ymm6, [rcx+96]
    vmovdqu ymm1, [rcx+128]
    vmovdqu ymm5, [rcx+160]
    vmovdqu ymm3, [rcx+192]
    vmovdqu ymm8, [rcx+224]

    vmovdqu ymm7, yword [perm0]
    vpermd ymm1, ymm7, ymm1
    vpermq ymm2, ymm2, 177
    vmovdqu ymm7, yword [perm1]
    vpermd ymm3, ymm7, ymm3
    vpermq ymm4, ymm4, 78
    vmovdqu ymm7, yword [perm0+16]
    vpermd ymm5, ymm7, ymm5
    vpermq ymm6, ymm6, 27
    vmovdqu ymm7, yword [perm1+16]
    vpermd ymm8, ymm7, ymm8

    vpblendd ymm7, ymm0, ymm4, 240
    vpblendd ymm0, ymm0, ymm4, 15
    vpblendd ymm4, ymm2, ymm6, 240
    vpblendd ymm2, ymm2, ymm6, 15
    vpblendd ymm6, ymm1, ymm5, 240
    vpblendd ymm1, ymm1, ymm5, 15
    vpblendd ymm5, ymm3, ymm8, 240
    vpblendd ymm3, ymm3, ymm8, 15

    vpblendd ymm8, ymm7, ymm4, 204
    vpblendd ymm7, ymm7, ymm4, 51
    vpblendd ymm4, ymm0, ymm2, 204
    vpblendd ymm0, ymm0, ymm2, 51
    vpblendd ymm2, ymm6, ymm5, 204
    vpblendd ymm6, ymm6, ymm5, 51
    vpblendd ymm5, ymm1, ymm3, 204
    vpblendd ymm1, ymm1, ymm3, 51

    vpblendd ymm3, ymm8, ymm2, 170
    vpblendd ymm8, ymm8, ymm2, 85
    vpblendd ymm2, ymm7, ymm1, 170
    vpblendd ymm7, ymm7, ymm1, 85
    vpblendd ymm1, ymm4, ymm5, 170
    vpblendd ymm4, ymm4, ymm5, 85
    vpblendd ymm5, ymm0, ymm6, 170
    vpblendd ymm0, ymm0, ymm6, 85

    vmovdqu ymm6, yword [perm1+8]
    vpermd ymm8, ymm6, ymm8
    vmovdqu ymm6, yword [perm2]
    vpermd ymm2, ymm6, ymm2
    vmovdqu ymm6, yword [perm3]
    vpermd ymm7, ymm6, ymm7
    vpermq ymm1, ymm1, 78
    vmovdqu ymm6, yword [perm1+24]
    vpermd ymm4, ymm6, ymm4
    vmovdqu ymm6, yword [perm2+16]
    vpermd ymm5, ymm6, ymm5
    vmovdqu ymm6, yword [perm3+16]
    vpermd ymm0, ymm6, ymm0

    vmovdqu [rcx], ymm3
    vmovdqu [rcx+32], ymm1
    vmovdqu [rcx+64], ymm2
    vmovdqu [rcx+96], ymm5
    vmovdqu [rcx+128], ymm8
    vmovdqu [rcx+160], ymm4
    vmovdqu [rcx+192], ymm7
    vmovdqu [rcx+224], ymm0

    .epilog:
    vmovdqu xmm6, [rsp+32]
    vmovdqu xmm7, [rsp+16]
    vmovdqu xmm8, [rsp]
    add rsp, 56
    ret

    section '.data' data readable writeable align 32
    align 32
    perm0 dd 1,0,7,6,5,4,3,2,1,0,7,6
    perm1 dd 7,6,1,0,3,2,5,4,7,6,1,0,3,2
    perm2 dd 2,7,0,5,6,3,4,1,2,7,0,5
    perm3 dd 3,6,1,4,7,2,5,0,3,6,1,4

    D:\gfortran\james\bitrev>fasm bitrev2.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    3 passes, 901 bytes.

    bitrev3.asm uses vpgatherdd, which is a really slow instruction.
    It was the easiest to write, though.

    D:\gfortran\james\bitrev>type bitrev3.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public bitrev3
    bitrev3:
    sub rsp, 72
    vmovdqu [rsp+48], xmm6
    vmovdqu [rsp+32], xmm7
    vmovdqu [rsp+16], xmm8
    vmovdqu [rsp], xmm9

    vmovdqu ymm6, yword [table]

    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm0, [rcx+ymm6], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm1, [rcx+ymm6+16], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm2, [rcx+ymm6+8], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm3, [rcx+ymm6+24], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm4, [rcx+ymm6+4], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm5, [rcx+ymm6+20], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm8, [rcx+ymm6+12], ymm7
    vpcmpeqd ymm7, ymm7, ymm7
    vpgatherdd ymm9, [rcx+ymm6+28], ymm7

    vmovdqu [rcx], ymm0
    vmovdqu [rcx+32], ymm1
    vmovdqu [rcx+64], ymm2
    vmovdqu [rcx+96], ymm3
    vmovdqu [rcx+128], ymm4
    vmovdqu [rcx+160], ymm5
    vmovdqu [rcx+192], ymm8
    vmovdqu [rcx+224], ymm9

    epilog:
    vmovdqu xmm6, [rsp+48]
    vmovdqu xmm7, [rsp+32]
    vmovdqu xmm8, [rsp+16]
    vmovdqu xmm9, [rsp]
    add rsp, 72
    ret

    section '.data' data readable writeable align 32
    align 32
    table dd 0,128,64,192,32,160,96,224

    D:\gfortran\james\bitrev>fasm bitrev3.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    3 passes, 401 bytes.

    bitrev4.asm just reads addresses to swap from a lookup table. It
    seems to have a hard limit of 56 clocks because my Haswell appears
    to have only only store pipeline.

    D:\gfortran\james\bitrev>type bitrev4.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public bitrev4
    bitrev4:
    push rbx
    lea r8, [table+56]
    mov r9, -56
    .outer:
    mov rax, qword [r8+r9]
    .inner:
    movzx edx, ah
    movzx r11d, al
    mov r10d, [rcx+4*rdx]
    mov ebx, [rcx+4*r11]
    mov [rcx+4*r11], r10d
    mov [rcx+4*rdx], ebx
    shr rax, 16
    jnz .inner
    add r9, 8
    jnz .outer
    pop rbx
    ret

    section '.data' data readable writeable align 32
    align 32
    table db 1,32,2,16,3,48,4,8,5,40,6,24,7,56,9,36,10,20,11,52,13
    db 44,14,28,15,60,17,34,19,50,21,42,22,26,23,58,25,38,27
    db 54,29,46,31,62,35,49,37,41,39,57,43,53,47,61,55,59 D:\gfortran\james\bitrev>fasm bitrev4.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    3 passes, 279 bytes.

    A timing function is required.

    D:\gfortran\james\bitrev>type rdtscp.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public _rdtscp
    _rdtscp:
    rdtscp
    shl rdx, 32
    or rax, rdx
    ret

    D:\gfortran\james\bitrev>fasm rdtscp.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    1 passes, 111 bytes.

    I used the free gfortran compiler to test the procedures. It first tests
    to make sure that the subroutine does its bit-reversal correctly: for
    each subroutine the sum of the absolute values of the differences
    between actual and theoretical outputs is shown to be zero. Then
    it goes into a loop where it times the subroutine for 1000 repititions
    and stores the time in an array. It prints out the results for 23 such
    timings and moves on to testing the next subroutine.

    To make a long story short these tests seemed to show that
    bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
    93, and bitrev4 about 72 clocks.

    Oh well, maybe I should have gone out and played in the snow
    instead today.

    D:\gfortran\james\bitrev>type timer.f90
    module funcs
    use ISO_FORTRAN_ENV
    implicit none
    interface
    subroutine bitrev1(x) bind(C,name='bitrev1')
    import
    implicit none
    real(REAL32), intent(inout) :: x(0:63)
    end subroutine bitrev1
    subroutine bitrev2(x) bind(C,name='bitrev2')
    import
    implicit none
    real(REAL32), intent(inout) :: x(0:63)
    end subroutine bitrev2
    subroutine bitrev3(x) bind(C,name='bitrev3')
    import
    implicit none
    real(REAL32), intent(inout) :: x(0:63)
    end subroutine bitrev3
    subroutine bitrev4(x) bind(C,name='bitrev4')
    import
    implicit none
    real(REAL32), intent(inout) :: x(0:63)
    end subroutine bitrev4
    function rdtscp() bind(C,name='_rdtscp')
    import
    implicit none
    integer(INT64) rdtscp
    end function rdtscp
    end interface
    contains
    function br64(x)
    integer br64
    integer, intent(in) :: x
    integer t
    t = x
    br64 = ((((ibits(t,0,1)*2+ibits(t,1,1))*2+ibits(t,2,1))*2+ &
    ibits(t,3,1))*2+ibits(t,4,1))*2+ibits(t,5,1)
    end function br64
    end module funcs

    program start
    use funcs
    implicit none
    real(REAL32) x(0:63)
    integer i
    real(REAL32) check
    integer(INT64) timestamps(24)
    type has_fun
    procedure(bitrev1), pointer, nopass :: fun
    end type has_fun
    type(has_fun) lots(4)
    integer j
    integer k

    lots(1)%fun => bitrev1
    lots(2)%fun => bitrev2
    lots(3)%fun => bitrev3
    lots(4)%fun => bitrev4
    do j = 1, 4
    x = [(i,i=0,63)]
    call lots(j)%fun(x)
    check = 0
    do i = 0, 63
    check = check+abs(br64(i)-x(i))
    end do
    write(*,'(*(g0))') 'bitrev',j,': check = ',check
    do i = 1, 1000
    call lots(j)%fun(x)
    check = rdtscp()
    end do
    timestamps(1) = rdtscp()
    do i = 2, size(timestamps)
    do k = 1, 1000
    call lots(j)%fun(x)
    end do
    timestamps(i) = rdtscp()
    end do
    write(*,'(g0)') timestamps(2:)-timestamps(:size(timestamps)-1)
    end do
    end program start

    D:\gfortran\james\bitrev>gfortran -O3 timer.f90 bitrev1.obj bitrev2.obj bitr obj bitrev4.obj rdtscp.obj -otimer

    D:\gfortran\james\bitrev>timer
    bitrev1: check = 0.00000000
    27324
    27007
    27185
    27027
    27010
    27025
    27067
    27006
    27067
    27025
    27067
    27025
    27067
    27006
    27067
    27025
    27276
    27027
    27013
    27025
    27067
    27006
    27067
    bitrev2: check = 0.00000000
    23300
    23126
    23215
    23232
    23209
    23232
    23206
    23232
    23194
    23221
    23194
    23232
    23221
    23235
    23194
    23232
    23197
    23232
    23194
    23221
    23194
    23235
    23194
    bitrev3: check = 0.00000000
    93190
    93139
    93116
    93124
    93119
    93124
    93116
    93113
    93127
    93113
    93127
    93113
    93116
    93124
    93116
    93124
    93116
    93116
    93124
    93113
    93113
    93124
    93113
    bitrev4: check = 0.00000000
    72801
    72772
    72062
    72101
    72086
    72074
    72086
    72074
    72086
    72074
    72086
    72074
    72086
    72074
    72086
    72074
    72086
    72074
    72086
    72074
    108450
    74655
    74803

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to James Van Buskirk on Mon May 23 08:36:54 2022
    James Van Buskirk wrote:
    Just for fun I thought I would try various strategies for permuting
    an array of single-precision floating point numbers using AVX2.
    bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
    synthesis with vperm2i128 and has a hard limit of 24 clocks to
    bit-reverse a 64-element array because all of these operations
    use pipeline 5.
    [snip]
    To make a long story short these tests seemed to show that
    bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
    93, and bitrev4 about 72 clocks.

    Interesting stuff, thanks for posting!


    Oh well, maybe I should have gone out and played in the snow
    instead today.

    Snow today, in late May? Are you in some New Zealand southern island mountains/ditto South America/Antarctica?

    Or just high up in the Rockies (US/Canada) and the snow is old stuff?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Van Buskirk@21:1/5 to Terje Mathisen on Mon May 23 16:43:40 2022
    "Terje Mathisen" wrote in message news:t6fa22$1lpt$1@gioia.aioe.org...

    James Van Buskirk wrote:
    Just for fun I thought I would try various strategies for permuting
    an array of single-precision floating point numbers using AVX2.
    bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
    synthesis with vperm2i128 and has a hard limit of 24 clocks to
    bit-reverse a 64-element array because all of these operations
    use pipeline 5.
    [snip]
    To make a long story short these tests seemed to show that
    bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
    93, and bitrev4 about 72 clocks.

    Interesting stuff, thanks for posting!

    I have an update coming for bitrev2.

    Oh well, maybe I should have gone out and played in the snow
    instead today.

    Snow today, in late May? Are you in some New Zealand southern island mountains/ditto South America/Antarctica?

    Or just high up in the Rockies (US/Canada) and the snow is old stuff?

    https://kdvr.com/weather/weather-forecast/near-record-90-degree-heat-thursday-snowstorm-friday/

    I did manage to play in the snow today. It was a little slippery in spots
    but nice and cool, perfect for hiking.

    https://bouldercolorado.gov/media/2530/download?inline=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Van Buskirk@21:1/5 to Terje Mathisen on Tue May 24 00:59:00 2022
    "Terje Mathisen" wrote in message news:t6htq1$ujc$1@gioia.aioe.org...

    James Van Buskirk wrote:

    I have an update coming for bitrev2.

    Nice. :-)

    Here it is. I have actually added some comments which show the
    contents of the register written, either a working register (when
    the data are the indices in the original array of the components)
    or a permutation register required for vpermd (a pity that AVX2
    doesn't have a lane-crossing version of palignr or a rotate
    instruction.)

    Didn't show any improvement in performance, but it certainly is
    cleaner code.

    D:\gfortran\james\bitrev>type bitrev2a.asm
    format MS64 COFF

    section '.text' code readable writeable executable
    public bitrev2
    bitrev2:
    sub rsp, 56
    vmovdqu [rsp+32], xmm6
    vmovdqu [rsp+16], xmm7
    vmovdqu [rsp], xmm8

    vmovdqu ymm0, [rcx] ; 0 1 2 3 4 5 6 7
    vmovdqu ymm4, [rcx+32] ; 8 9 10 11 12 13 14 15
    vmovdqu ymm2, [rcx+64] ; 16 17 18 19 20 21 22 23
    vmovdqu ymm6, [rcx+96] ; 24 25 26 27 28 29 30 31
    vmovdqu ymm1, [rcx+128] ; 32 33 34 35 36 37 38 39
    vmovdqu ymm5, [rcx+160] ; 40 41 42 43 44 45 46 47
    vmovdqu ymm3, [rcx+192] ; 48 49 50 51 52 53 54 55
    vmovdqu ymm8, [rcx+224] ; 56 57 58 59 60 61 62 63

    vmovdqu ymm7, yword [perm0+24] ; 7 0 1 2 3 4 5 6
    vpermd ymm1, ymm7, ymm1 ; 39 32 33 34 35 36 37 38
    vpermq ymm2, ymm2, 147 ; 22 23 16 17 18 19 20 21
    vmovdqu ymm7, yword [perm0+16] ; 5 6 7 0 1 2 3 4
    vpermd ymm3, ymm7, ymm3 ; 53 54 55 48 49 50 51 52
    vpermq ymm4, ymm4, 78 ; 12 13 14 15 8 9 10 11
    vmovdqu ymm7, yword [perm0+8] ; 3 4 5 6 7 0 1 2
    vpermd ymm5, ymm7, ymm5 ; 43 44 45 46 47 40 41 42
    vpermq ymm6, ymm6, 57 ; 26 27 28 29 30 31 24 25
    vmovdqu ymm7, yword [perm0] ; 1 2 3 4 5 6 7 0
    vpermd ymm8, ymm7, ymm8 ; 57 58 59 60 61 62 63 56

    vpblendd ymm7, ymm0, ymm1, 170 ; 0 32 2 34 4 36 6 38
    vpblendd ymm0, ymm0, ymm1, 85 ; 39 1 33 3 35 5 37 7
    vpblendd ymm1, ymm2, ymm3, 170 ; 22 54 16 48 18 50 20 52
    vpblendd ymm2, ymm2, ymm3, 85 ; 53 23 55 17 49 19 51 21
    vpblendd ymm3, ymm4, ymm5, 170 ; 12 44 14 46 8 40 10 42
    vpblendd ymm4, ymm4, ymm5, 85 ; 43 13 45 15 47 9 41 11
    vpblendd ymm5, ymm6, ymm8, 170 ; 26 58 28 60 30 62 24 56
    vpblendd ymm6, ymm6, ymm8, 85 ; 57 27 59 29 61 31 63 25

    vpblendd ymm8, ymm7, ymm1, 204 ; 0 32 16 48 4 36 20 52
    vpblendd ymm7, ymm7, ymm1, 51 ; 22 54 2 34 18 50 6 38
    vpblendd ymm1, ymm0, ymm2, 102 ; 39 23 55 3 35 19 51 7
    vpblendd ymm0, ymm0, ymm2, 153 ; 53 1 33 17 49 5 37 21
    vpblendd ymm2, ymm3, ymm5, 204 ; 12 44 28 60 8 40 24 56
    vpblendd ymm3, ymm3, ymm5, 51 ; 26 58 14 46 30 62 10 42
    vpblendd ymm5, ymm4, ymm6, 102 ; 43 27 59 15 47 31 63 11
    vpblendd ymm4, ymm4, ymm6, 153 ; 57 13 45 29 61 9 41 25

    vpblendd ymm6, ymm8, ymm2, 240 ; 0 32 16 48 8 40 24 56
    vpblendd ymm8, ymm8, ymm2, 15 ; 12 44 28 60 4 36 20 52
    vpblendd ymm2, ymm7, ymm3, 60 ; 22 54 14 46 30 62 6 38
    vpblendd ymm7, ymm7, ymm3, 195 ; 26 58 2 34 18 50 10 42
    vpblendd ymm3, ymm1, ymm5, 120 ; 39 23 55 15 47 31 63 7
    vpblendd ymm1, ymm1, ymm5, 135 ; 43 27 59 3 35 19 51 11
    vpblendd ymm5, ymm0, ymm4, 30 ; 53 13 45 29 61 5 37 21
    vpblendd ymm0, ymm0, ymm4, 225 ; 57 1 33 17 49 9 41 25

    vpermq ymm8, ymm8, 78 ; 4 36 20 52 12 44 28 60
    vpermq ymm2, ymm2, 147 ; 6 38 22 54 14 46 30 62
    vpermq ymm7, ymm7, 57 ; 2 34 18 50 10 42 26 58
    vmovdqu ymm4, yword [perm0+24] ; 7 0 1 2 3 4 5 6
    vpermd ymm3, ymm4, ymm3 ; 7 39 23 55 15 47 31 63
    vmovdqu ymm4, yword [perm0+8] ; 3 4 5 6 7 0 1 2
    vpermd ymm1, ymm4, ymm1 ; 3 35 19 51 11 43 27 59
    vmovdqu ymm4, yword [perm0+16] ; 5 6 7 0 1 2 3 4
    vpermd ymm5, ymm4, ymm5 ; 5 37 21 53 13 45 29 61
    vmovdqu ymm4, yword [perm0] ; 1 2 3 4 5 6 7 0
    vpermd ymm0, ymm4, ymm0 ; 1 33 17 49 9 41 25 57

    vmovdqu [rcx], ymm6
    vmovdqu [rcx+32], ymm8
    vmovdqu [rcx+64], ymm7
    vmovdqu [rcx+96], ymm2
    vmovdqu [rcx+128], ymm0
    vmovdqu [rcx+160], ymm5
    vmovdqu [rcx+192], ymm1
    vmovdqu [rcx+224], ymm3

    .epilog:
    vmovdqu xmm6, [rsp+32]
    vmovdqu xmm7, [rsp+16]
    vmovdqu xmm8, [rsp]
    add rsp, 56
    ret

    section '.data' data readable writeable align 32
    align 32
    perm0 dd 1,2,3,4,5,6,7,0,1,2,3,4,5,6

    D:\gfortran\james\bitrev>fasm bitrev2a.asm
    flat assembler version 1.71.49 (1048576 kilobytes memory)
    3 passes, 723 bytes.

    D:\gfortran\james\bitrev>gfortran -O3 timer.f90 bitrev1.obj bitrev2a.obj bitrev3
    .obj bitrev4.obj rdtscp.obj -otimer

    D:\gfortran\james\bitrev>timer
    [...]
    bitrev2: check = 0.00000000
    22872
    22703
    22721
    22724
    22752
    22672
    22709
    22740
    22736
    22721
    22706
    22727
    22709
    22721
    22706
    22718
    22712
    22718
    22706
    22727
    22712
    22718
    22706
    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to James Van Buskirk on Tue May 24 08:26:13 2022
    James Van Buskirk wrote:
    "Terje Mathisen"  wrote in message news:t6fa22$1lpt$1@gioia.aioe.org...

    James Van Buskirk wrote:
    Just for fun I thought I would try various strategies for permuting
    an array of single-precision floating point numbers using AVX2.
    bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
    synthesis with vperm2i128 and has a hard limit of 24 clocks to
    bit-reverse a 64-element array because all of these operations
    use pipeline 5.
    [snip]
    To make a long story short these tests seemed to show that
    bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
    93, and bitrev4 about 72 clocks.

    Interesting stuff, thanks for posting!

    I have an update coming for bitrev2.

    Nice. :-)

    Oh well, maybe I should have gone out and played in the snow
    instead today.

    Snow today, in late May? Are you in some New Zealand southern island
    mountains/ditto South America/Antarctica?

    Or just high up in the Rockies (US/Canada) and the snow is old stuff?

    https://kdvr.com/weather/weather-forecast/near-record-90-degree-heat-thursday-snowstorm-friday/

    That link is location-limited, but it did confirm that we are talking
    about the Colorado side of the Rockies.


    I did manage to play in the snow today. It was a little slippery in spots
    but nice and cool, perfect for hiking.

    https://bouldercolorado.gov/media/2530/download?inline=

    Seems similar to my last trip to Boulder in Nov 2019, before Covid:
    Hiked/ran to the top of the Flatirons in 2-5 cm of fresh snow: A bit
    slippery in a few places with just rocks/slabs for footing.

    https://photos.app.goo.gl/v2bRgcqtL1hy5Wtm8

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)