• Re: OOS approach revisited

    From minforth@21:1/5 to LIT on Fri Jun 27 02:16:20 2025
    On Thu, 26 Jun 2025 17:27:48 +0000, LIT wrote:

    The saving come from rolling @ @ + ! into a single very specialized
    function. But what about the loading of X Y and retrieving of Z which
    are unavoidable in practice? Should that not be included in the test?

    Let's find out then:

    1 VARIABLE X
    2 VARIABLE Y
    3 VARIABLE Z

    : TEST1 1000 0 DO 10000 0 DO
    I DUP X ! Y ! X @ Y @ + Z ! Z @ DROP
    LOOP LOOP ;
    : TEST2 1000 0 DO 10000 0 DO
    I DUP X ! Y ! X Y Z +> Z @ DROP
    LOOP LOOP ;
    TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 252 ok
    TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 202 ok

    : TEST1 1000 0 DO 10000 0 DO
    I DUP X ! Y ! 1 X +! 1 Y +! X @ Y @ + Z ! Z @ DROP
    LOOP LOOP ;
    : TEST2 1000 0 DO 10000 0 DO
    I DUP X ! Y ! X ++ Y ++ X Y Z +> Z @ DROP
    LOOP LOOP ;
    TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 346 ok
    TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 258 ok

    The difference is smaller - still it's significant.


    Another test - using the "drawing a box" example
    from "Thinking Forth" (and "simulated" LINE word):

    0 VARIABLE TOP
    0 VARIABLE LEFT
    0 VARIABLE BOTTOM
    0 VARIABLE RIGHT
    : LINE 2DROP 2DROP ;

    : BOX1 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
    LEFT @ TOP @ RIGHT @ TOP @ LINE
    RIGHT @ TOP @ RIGHT @ BOTTOM @ LINE
    RIGHT @ BOTTOM @ LEFT @ BOTTOM @ LINE
    LEFT @ BOTTOM @ LEFT @ TOP @ LINE ;

    : BOX2 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
    LEFT TOP RIGHT TOP LINE
    RIGHT TOP RIGHT BOTTOM LINE
    RIGHT BOTTOM LEFT BOTTOM LINE
    LEFT BOTTOM LEFT TOP LINE ;

    : TEST1 1000 0 DO 10000 0 DO I DUP 2DUP BOX1 LOOP LOOP ;
    : TEST2 1000 0 DO 10000 0 DO I DUP 2DUP BOX2 LOOP LOOP ;

    TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 890 ok
    TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 653 ok


    The difference is even more significant in case
    of multiplication:

    1 VARIABLE X
    2 VARIABLE Y
    3 VARIABLE Z

    : TEST1 1000 0 DO 10000 0 DO
    I DUP X ! Y ! X @ Y @ * Z ! Z @ DROP
    LOOP LOOP ;
    : TEST2 1000 0 DO 10000 0 DO
    I DUP X ! Y ! X Y Z *> Z @ DROP
    LOOP LOOP ;
    TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 658 ok
    TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 200 ok

    But this time better implementation has also
    its impact; fig-Forth's '*' is inefficient,
    and I coded '*>' of course directly in ML,
    simply using IMUL.


    With so many DO..LOOPs involved, be careful not to
    measure more looping time than the multiplications.

    IIRC DO..LOOPs had been a hack for computers in the 60s.
    A rather ugly hack, born out of necessity, slow and
    often cumbersome to use. That it still persists in Forth
    half a century later speaks for Forth's progressiveness.

    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Fri Jun 27 11:49:11 2025
    Am 27.06.2025 um 09:29 schrieb dxf:
    On 27/06/2025 12:16 pm, minforth wrote:
    ...
    IIRC DO..LOOPs had been a hack for computers in the 60s.
    A rather ugly hack, born out of necessity, slow and
    often cumbersome to use. That it still persists in Forth
    half a century later speaks for Forth's progressiveness.

    Testing FOR NEXT on my DTC system showed 15% speed increase over
    DO LOOP. Putting 5 NOOPs (executes forth's address interpreter)
    in the innermost loop brought it down to 6%. Not worth it IMO.


    It really depends on how counted loops are implemented.
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to LIT on Fri Jun 27 20:15:16 2025
    In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    It really depends on how counted loops are implemented.
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    In that old fig-Forth it's rather short and simple:

    sqHeader '(LOOP)'
    XLOOP dw $ + 2
    mov BX,1
    XLOO1: add [BP],BX
    mov AX,[BP]
    sub AX,[BP+2]
    xor AX,BX
    js BRAN1
    add BP,4
    inc SI
    inc SI
    jmp NEXT

    It doesn't look that bad. Can it be
    done even shorter?

    My optimiser looks into the combination of DO and LOOP,
    transfers the returns stack into registers after inlining
    everything. It is near vfx performance.
    All experimental, but yes there is much to be gained.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Fri Jun 27 22:35:32 2025
    Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:
    In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    It really depends on how counted loops are implemented.
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    In that old fig-Forth it's rather short and simple:

    sqHeader '(LOOP)'
    XLOOP dw $ + 2
    mov BX,1
    XLOO1: add [BP],BX
    mov AX,[BP]
    sub AX,[BP+2]
    xor AX,BX
    js BRAN1
    add BP,4
    inc SI
    inc SI
    jmp NEXT

    It doesn't look that bad. Can it be
    done even shorter?

    My optimiser looks into the combination of DO and LOOP,
    transfers the returns stack into registers after inlining
    everything. It is near vfx performance.
    All experimental, but yes there is much to be gained.

    Must be tricky to do UNLOOP in a register-based loop. ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Pelc@21:1/5 to minforth on Sat Jun 28 09:37:33 2025
    On 27 Jun 2025 at 22:35:32 CEST, "minforth" <minforth@gmx.net> wrote:

    Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:
    In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    It really depends on how counted loops are implemented.
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    In that old fig-Forth it's rather short and simple:

    sqHeader '(LOOP)'
    XLOOP dw $ + 2
    mov BX,1
    XLOO1: add [BP],BX
    mov AX,[BP]
    sub AX,[BP+2]
    xor AX,BX
    js BRAN1
    add BP,4
    inc SI
    inc SI
    jmp NEXT

    It doesn't look that bad. Can it be
    done even shorter?

    My optimiser looks into the combination of DO and LOOP,
    transfers the returns stack into registers after inlining
    everything. It is near vfx performance.
    All experimental, but yes there is much to be gained.

    Must be tricky to do UNLOOP in a register-based loop. ;-)

    Here are the code generators for VFX x64 LOOP and UNLOOP.
    All the complexity is in the DO and ?DO code.

    : c_loop \ mrk> drbid -- ; compile code for LOOP ; SFP094
    c_shuffle reset-opt \ SFP097
    a[ INC r14 ]a use-a \ update index
    a[ INC r15 ]a use-a \ update limit-index-$8000.0000
    a[ JNO ]a <ares use-a \ resolve backward branch
    c_unloop \ remove DO ... LOOP state
    >RES \ resolve forward branch
    ;

    : c_unloop \ -- ; compile code for UNLOOP
    c_shuffle reset-opt
    a[ pop r14 \ restore old index
    pop r15 \ restore old index-limit-xorbit63
    pop rax \ discarded
    ]a use-a
    ;

    Stephen
    --
    Stephen Pelc, stephen@vfxforth.com
    Wodni & Pelc GmbH
    Vienna, Austria
    Tel: +44 (0)7803 903612, +34 649 662 974 http://www.vfxforth.com/downloads/VfxCommunity/
    free VFX Forth downloads

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to minforth@gmx.net on Sat Jun 28 11:34:10 2025
    In article <mc8dkkFeh4uU1@mid.individual.net>,
    minforth <minforth@gmx.net> wrote:
    Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:
    In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    It really depends on how counted loops are implemented.
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    In that old fig-Forth it's rather short and simple:

    sqHeader '(LOOP)'
    XLOOP dw $ + 2
    mov BX,1
    XLOO1: add [BP],BX
    mov AX,[BP]
    sub AX,[BP+2]
    xor AX,BX
    js BRAN1
    add BP,4
    inc SI
    inc SI
    jmp NEXT

    It doesn't look that bad. Can it be
    done even shorter?

    My optimiser looks into the combination of DO and LOOP,
    transfers the returns stack into registers after inlining
    everything. It is near vfx performance.
    All experimental, but yes there is much to be gained.

    Must be tricky to do UNLOOP in a register-based loop. ;-)

    Indeed. One of my testcase is:
    ---------------------------
    \ Interfering LEAVEs and EXITs.
    : (TESTL) DO
    DUP IF LEAVE ELSE UNLOOP EXIT THEN SWAP
    2DUP IF LEAVE ELSE UNLOOP EXIT THEN 2SWAP
    LOOP ROT ;
    : testL (TESTL) 2OVER ;
    SEE (TESTL)
    'testL SHOW-IT
    -------------------------------------------------
    The result after inlining is:

    ---------------------------------------------


    : testL
    0 >R SWAP >R >R DUP
    0BRANCH [ 20 , ] ( between ? UNLOOP )
    BRANCH [ B0 , ] ( between ? UNLOOP )
    BRANCH [ 18 , ] ( between ? SWAP ) UNLOOP
    BRANCH [ 98 , ] ( between ROT 2OVER ) SWAP 2DUP
    0BRANCH [ 20 , ] ( between ? UNLOOP )
    BRANCH [ 58 , ] ( between ? UNLOOP )
    BRANCH [ 18 , ] ( between ? 2SWAP ) UNLOOP
    BRANCH [ 40 , ] ( between ROT 2OVER ) 2SWAP 1 (+LOOP)
    0BRANCH [ -D8 , ] ( between >R DUP ) UNLOOP ROT 2OVER
    ;

    --------------------------------------------
    then after some peepholing the assembly looks like
    Report about return stack usage
    new report
    2 8 1 0
    1 9 1 1
    0 10 2 2

    LEA, BP'| XO| [BP] 4294967272 L,
    Q: MOVI, X| R| BX| 0 IL,
    Q: MOV, X| F| BX'| XO| [BP] 16 L,
    POP|X, DX|
    POP|X, AX|
    PUSH|X, DX|
    Q: MOV, X| F| AX'| XO| [BP] 8 L,
    POP|X, BX|
    Q: MOV, X| F| BX'| XO| [BP] 0 L,
    POP|X, AX|
    PUSH|X, AX|
    Q: TEST, X| AX'| R| AX|
    J|X, Z| Y| 17 (RL,)
    POP|X, DX|
    POP|X, BX|
    POP|X, AX|
    PUSH|X, BX|
    PUSH|X, DX|
    PUSH|X, AX|
    LEA, BP'| XO| [BP] 24 L,
    JMP, 27 (RL,)
    LEA, BP'| XO| [BP] 24 L,
    JMP, 16 (RL,)
    JMP, 4294967263 (RL,)
    LEA, BP'| XO| [BP] 24 L,
    JMP, 0 (RL,)
    POP|X, BX|
    POP|X, CX|
    POP|X, AX|
    POP|X, DX|
    PUSH|X, DX|
    PUSH|X, AX|
    PUSH|X, CX|
    PUSH|X, BX|
    PUSH|X, DX|
    PUSH|X, AX|

    You see that here the elimination of BP (return stack) has not succeeded.
    Three BRANCH/0BRANCH have disappeared though.

    Simple cases are more succesful:
    ------------------------------------
    : test2aa 4 >R 2 >R 1 R> 3 R> ;
    'test2aa SHOW-IT
    : test2aa
    4 >R 2 >R 1 R> 3 R>
    ;
    Report about return stack usage
    new report
    1 8 1 1
    0 9 1 1

    PUSHI|X, 1 IL,
    PUSHI|X, 2 IL,
    PUSHI|X, 3 IL,
    QN: MOVI, X| R| AX| 4 IL,
    QN: MOVI, X| R| CX| 2 IL,
    PUSHI|X, 4 IL,
    ------------------------------------
    But the optimiser doesn't detect that moving into registers AX CX can
    be eleminated. (only DSP RSP and HIP - present in SP BP DI - are live
    at the end of a definition.





    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to minforth on Sat Jun 28 17:46:49 2025
    minforth@gmx.net (minforth) writes:
    IIRC DO..LOOPs had been a hack for computers in the 60s.
    A rather ugly hack, born out of necessity, slow and
    often cumbersome to use.

    Could you elaborate on "a hack for computers in the 60s"?

    And while DO has an obvious shortcoming (partially addressed by ?DO),
    I have found that variations on ?DO..LOOP are quite helpful in keeping
    the number of items on the data stack manageable. They mean that I
    don't have to deal with the index and limit in the loop body, and that
    they are also out of the way, so I don't have to think about them in
    the loop body. And when I need the loop index, "I" gives it to me,
    like an automatically-defined local. To gain these advantages, I
    often prefer to massage the start and limit values more than
    otherwise. E.g., to walk through a cell array in the forward
    direction, I usually prefer

    ( x y addr u ) cells bounds ?do ( x1 y1 )
    ... i @ ...
    1 cells +loop

    rather than

    ( x y addr u ) 0 ?do ( x1 y1 addr )
    ... dup i th @ ...
    loop
    drop

    because I then can access x1 and y1 in the loop without ADDR being in
    the way.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)