Forum: >>> Magnum BBS <<<

Re: OOS approach revisited

From minforth@21:1/5 to LIT on Fri Jun 27 02:16:20 2025

On Thu, 26 Jun 2025 17:27:48 +0000, LIT wrote:

The saving come from rolling @ @ + ! into a single very specialized
function. But what about the loading of X Y and retrieving of Z which
are unavoidable in practice? Should that not be included in the test?

Let's find out then:

1 VARIABLE X
2 VARIABLE Y
3 VARIABLE Z

: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! X @ Y @ + Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X Y Z +> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 252 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 202 ok

: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! 1 X +! 1 Y +! X @ Y @ + Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X ++ Y ++ X Y Z +> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 346 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 258 ok

The difference is smaller - still it's significant.

Another test - using the "drawing a box" example
from "Thinking Forth" (and "simulated" LINE word):

0 VARIABLE TOP
0 VARIABLE LEFT
0 VARIABLE BOTTOM
0 VARIABLE RIGHT
: LINE 2DROP 2DROP ;

: BOX1 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
LEFT @ TOP @ RIGHT @ TOP @ LINE
RIGHT @ TOP @ RIGHT @ BOTTOM @ LINE
RIGHT @ BOTTOM @ LEFT @ BOTTOM @ LINE
LEFT @ BOTTOM @ LEFT @ TOP @ LINE ;

: BOX2 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
LEFT TOP RIGHT TOP LINE
RIGHT TOP RIGHT BOTTOM LINE
RIGHT BOTTOM LEFT BOTTOM LINE
LEFT BOTTOM LEFT TOP LINE ;

: TEST1 1000 0 DO 10000 0 DO I DUP 2DUP BOX1 LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO I DUP 2DUP BOX2 LOOP LOOP ;

TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 890 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 653 ok

The difference is even more significant in case
of multiplication:

1 VARIABLE X
2 VARIABLE Y
3 VARIABLE Z

: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! X @ Y @ * Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X Y Z *> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 658 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 200 ok

But this time better implementation has also
its impact; fig-Forth's '*' is inefficient,
and I coded '*>' of course directly in ML,
simply using IMUL.

With so many DO..LOOPs involved, be careful not to
measure more looping time than the multiplications.

IIRC DO..LOOPs had been a hack for computers in the 60s.
A rather ugly hack, born out of necessity, slow and
often cumbersome to use. That it still persists in Forth
half a century later speaks for Forth's progressiveness.

--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Fri Jun 27 11:49:11 2025

Am 27.06.2025 um 09:29 schrieb dxf:

On 27/06/2025 12:16 pm, minforth wrote:

...
IIRC DO..LOOPs had been a hack for computers in the 60s.
A rather ugly hack, born out of necessity, slow and
often cumbersome to use. That it still persists in Forth
half a century later speaks for Forth's progressiveness.

Testing FOR NEXT on my DTC system showed 15% speed increase over
DO LOOP. Putting 5 NOOPs (executes forth's address interpreter)
in the innermost loop brought it down to 6%. Not worth it IMO.

It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to LIT on Fri Jun 27 20:15:16 2025

In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

In that old fig-Forth it's rather short and simple:

sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT

It doesn't look that bad. Can it be
done even shorter?

My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Fri Jun 27 22:35:32 2025

Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:

In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

In that old fig-Forth it's rather short and simple:

sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT

It doesn't look that bad. Can it be
done even shorter?

My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.

Must be tricky to do UNLOOP in a register-based loop. ;-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Pelc@21:1/5 to minforth on Sat Jun 28 09:37:33 2025

On 27 Jun 2025 at 22:35:32 CEST, "minforth" <minforth@gmx.net> wrote:

Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:

In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

In that old fig-Forth it's rather short and simple:

sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT

It doesn't look that bad. Can it be
done even shorter?

My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.

Must be tricky to do UNLOOP in a register-based loop. ;-)

Here are the code generators for VFX x64 LOOP and UNLOOP.
All the complexity is in the DO and ?DO code.

: c_loop \ mrk> drbid -- ; compile code for LOOP ; SFP094
c_shuffle reset-opt \ SFP097
a[ INC r14 ]a use-a \ update index
a[ INC r15 ]a use-a \ update limit-index-$8000.0000
a[ JNO ]a <ares use-a \ resolve backward branch
c_unloop \ remove DO ... LOOP state
>RES \ resolve forward branch
;

: c_unloop \ -- ; compile code for UNLOOP
c_shuffle reset-opt
a[ pop r14 \ restore old index
pop r15 \ restore old index-limit-xorbit63
pop rax \ discarded
]a use-a
;

Stephen
--
Stephen Pelc, stephen@vfxforth.com
Wodni & Pelc GmbH
Vienna, Austria
Tel: +44 (0)7803 903612, +34 649 662 974 http://www.vfxforth.com/downloads/VfxCommunity/
free VFX Forth downloads

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to minforth@gmx.net on Sat Jun 28 11:34:10 2025

In article <mc8dkkFeh4uU1@mid.individual.net>,
minforth <minforth@gmx.net> wrote:

Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:

In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

In that old fig-Forth it's rather short and simple:

sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT

It doesn't look that bad. Can it be
done even shorter?

My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.

Must be tricky to do UNLOOP in a register-based loop. ;-)

Indeed. One of my testcase is:
---------------------------
\ Interfering LEAVEs and EXITs.
: (TESTL) DO
DUP IF LEAVE ELSE UNLOOP EXIT THEN SWAP
2DUP IF LEAVE ELSE UNLOOP EXIT THEN 2SWAP
LOOP ROT ;
: testL (TESTL) 2OVER ;
SEE (TESTL)
'testL SHOW-IT
-------------------------------------------------
The result after inlining is:

---------------------------------------------

: testL
0 >R SWAP >R >R DUP
0BRANCH [ 20 , ] ( between ? UNLOOP )
BRANCH [ B0 , ] ( between ? UNLOOP )
BRANCH [ 18 , ] ( between ? SWAP ) UNLOOP
BRANCH [ 98 , ] ( between ROT 2OVER ) SWAP 2DUP
0BRANCH [ 20 , ] ( between ? UNLOOP )
BRANCH [ 58 , ] ( between ? UNLOOP )
BRANCH [ 18 , ] ( between ? 2SWAP ) UNLOOP
BRANCH [ 40 , ] ( between ROT 2OVER ) 2SWAP 1 (+LOOP)
0BRANCH [ -D8 , ] ( between >R DUP ) UNLOOP ROT 2OVER
;

--------------------------------------------
then after some peepholing the assembly looks like
Report about return stack usage
new report
2 8 1 0
1 9 1 1
0 10 2 2

LEA, BP'| XO| [BP] 4294967272 L,
Q: MOVI, X| R| BX| 0 IL,
Q: MOV, X| F| BX'| XO| [BP] 16 L,
POP|X, DX|
POP|X, AX|
PUSH|X, DX|
Q: MOV, X| F| AX'| XO| [BP] 8 L,
POP|X, BX|
Q: MOV, X| F| BX'| XO| [BP] 0 L,
POP|X, AX|
PUSH|X, AX|
Q: TEST, X| AX'| R| AX|
J|X, Z| Y| 17 (RL,)
POP|X, DX|
POP|X, BX|
POP|X, AX|
PUSH|X, BX|
PUSH|X, DX|
PUSH|X, AX|
LEA, BP'| XO| [BP] 24 L,
JMP, 27 (RL,)
LEA, BP'| XO| [BP] 24 L,
JMP, 16 (RL,)
JMP, 4294967263 (RL,)
LEA, BP'| XO| [BP] 24 L,
JMP, 0 (RL,)
POP|X, BX|
POP|X, CX|
POP|X, AX|
POP|X, DX|
PUSH|X, DX|
PUSH|X, AX|
PUSH|X, CX|
PUSH|X, BX|
PUSH|X, DX|
PUSH|X, AX|

You see that here the elimination of BP (return stack) has not succeeded.
Three BRANCH/0BRANCH have disappeared though.

Simple cases are more succesful:
------------------------------------
: test2aa 4 >R 2 >R 1 R> 3 R> ;
'test2aa SHOW-IT
: test2aa
4 >R 2 >R 1 R> 3 R>
;
Report about return stack usage
new report
1 8 1 1
0 9 1 1

PUSHI|X, 1 IL,
PUSHI|X, 2 IL,
PUSHI|X, 3 IL,
QN: MOVI, X| R| AX| 4 IL,
QN: MOVI, X| R| CX| 2 IL,
PUSHI|X, 4 IL,
------------------------------------
But the optimiser doesn't detect that moving into registers AX CX can
be eleminated. (only DSP RSP and HIP - present in SP BP DI - are live
at the end of a definition.

--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Sat Jun 28 17:46:49 2025

minforth@gmx.net (minforth) writes:

IIRC DO..LOOPs had been a hack for computers in the 60s.
A rather ugly hack, born out of necessity, slow and
often cumbersome to use.

Could you elaborate on "a hack for computers in the 60s"?

And while DO has an obvious shortcoming (partially addressed by ?DO),
I have found that variations on ?DO..LOOP are quite helpful in keeping
the number of items on the data stack manageable. They mean that I
don't have to deal with the index and limit in the loop body, and that
they are also out of the way, so I don't have to think about them in
the loop body. And when I need the loop index, "I" gives it to me,
like an automatically-defined local. To gain these advantages, I
often prefer to massage the start and limit values more than
otherwise. E.g., to walk through a cell array in the forward
direction, I usually prefer

( x y addr u ) cells bounds ?do ( x1 y1 )
... i @ ...
1 cells +loop

rather than

( x y addr u ) 0 ?do ( x1 y1 addr )
... dup i th @ ...
loop
drop

because I then can access x1 and y1 in the loop without ADDR being in
the way.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Csaba
  Wed Jul 2 16:56:31 2025
  from Budapest via SSH
- Dindjarin
  Wed Jul 2 16:51:20 2025
  from Mandalore via SSH
- Bob Worm
  Wed Jul 2 08:43:00 2025
  from Wales, Uk via Telnet
- Dindjarin
  Wed Jul 2 06:31:28 2025
  from Mandalore via SSH
- Dindjarin
  Wed Jul 2 06:16:35 2025
  from Mandalore via SSH
- Dindjarin
  Wed Jul 2 05:31:07 2025
  from Mandalore via SSH
- Dindjarin
  Wed Jul 2 05:23:25 2025
  from Mandalore via SSH
- Dindjarin
  Wed Jul 2 02:18:23 2025
  from Mandalore via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	504
Nodes:	16 (2 / 14)
Uptime:	237:30:49
Calls:	9,887
Calls today:	9
Files:	13,794
Messages:	6,207,246
Posted today:	3

Re: OOS approach revisited

Who's Online

Recent Visitors

System Info