The saving come from rolling @ @ + ! into a single very specialized
function. But what about the loading of X Y and retrieving of Z which
are unavoidable in practice? Should that not be included in the test?
Let's find out then:
1 VARIABLE X
2 VARIABLE Y
3 VARIABLE Z
: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! X @ Y @ + Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X Y Z +> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 252 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 202 ok
: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! 1 X +! 1 Y +! X @ Y @ + Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X ++ Y ++ X Y Z +> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 346 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 258 ok
The difference is smaller - still it's significant.
Another test - using the "drawing a box" example
from "Thinking Forth" (and "simulated" LINE word):
0 VARIABLE TOP
0 VARIABLE LEFT
0 VARIABLE BOTTOM
0 VARIABLE RIGHT
: LINE 2DROP 2DROP ;
: BOX1 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
LEFT @ TOP @ RIGHT @ TOP @ LINE
RIGHT @ TOP @ RIGHT @ BOTTOM @ LINE
RIGHT @ BOTTOM @ LEFT @ BOTTOM @ LINE
LEFT @ BOTTOM @ LEFT @ TOP @ LINE ;
: BOX2 ( x1 y1 x2 y2) BOTTOM ! RIGHT ! TOP ! LEFT !
LEFT TOP RIGHT TOP LINE
RIGHT TOP RIGHT BOTTOM LINE
RIGHT BOTTOM LEFT BOTTOM LINE
LEFT BOTTOM LEFT TOP LINE ;
: TEST1 1000 0 DO 10000 0 DO I DUP 2DUP BOX1 LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO I DUP 2DUP BOX2 LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 890 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 653 ok
The difference is even more significant in case
of multiplication:
1 VARIABLE X
2 VARIABLE Y
3 VARIABLE Z
: TEST1 1000 0 DO 10000 0 DO
I DUP X ! Y ! X @ Y @ * Z ! Z @ DROP
LOOP LOOP ;
: TEST2 1000 0 DO 10000 0 DO
I DUP X ! Y ! X Y Z *> Z @ DROP
LOOP LOOP ;
TICKS TEST1 TICKS 2SWAP DMINUS D+ D. 658 ok
TICKS TEST2 TICKS 2SWAP DMINUS D+ D. 200 ok
But this time better implementation has also
its impact; fig-Forth's '*' is inefficient,
and I coded '*>' of course directly in ML,
simply using IMUL.
On 27/06/2025 12:16 pm, minforth wrote:
...
IIRC DO..LOOPs had been a hack for computers in the 60s.
A rather ugly hack, born out of necessity, slow and
often cumbersome to use. That it still persists in Forth
half a century later speaks for Forth's progressiveness.
Testing FOR NEXT on my DTC system showed 15% speed increase over
DO LOOP. Putting 5 NOOPs (executes forth's address interpreter)
in the innermost loop brought it down to 6%. Not worth it IMO.
It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.
If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.
In that old fig-Forth it's rather short and simple:
sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT
It doesn't look that bad. Can it be
done even shorter?
In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:
It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.
If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.
In that old fig-Forth it's rather short and simple:
sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT
It doesn't look that bad. Can it be
done even shorter?
My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.
Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:
In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:
It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.
If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.
In that old fig-Forth it's rather short and simple:
sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT
It doesn't look that bad. Can it be
done even shorter?
My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.
Must be tricky to do UNLOOP in a register-based loop. ;-)
Am 27.06.2025 um 20:15 schrieb albert@spenarnc.xs4all.nl:
In article <bc63996456fe967e5c66d17cbbeb21c2@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:
It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.
If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.
In that old fig-Forth it's rather short and simple:
sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT
It doesn't look that bad. Can it be
done even shorter?
My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.
Must be tricky to do UNLOOP in a register-based loop. ;-)
IIRC DO..LOOPs had been a hack for computers in the 60s.
A rather ugly hack, born out of necessity, slow and
often cumbersome to use.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 504 |
Nodes: | 16 (2 / 14) |
Uptime: | 237:30:49 |
Calls: | 9,887 |
Calls today: | 9 |
Files: | 13,794 |
Messages: | 6,207,246 |
Posted today: | 3 |