At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that.
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the >does-defined word in a case where that word is not inlined.
- anton
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
Let's check. In kForth-64, an indirect threaded code system,
.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok
So b5 appears to be more efficient that b4 ( the version with DOES> ).
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the >>does-defined word in a case where that word is not inlined.
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
Let's check. In kForth-64, an indirect threaded code system,
.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok
So b5 appears to be more efficient that b4 ( the version with DOES> ).
--
Krishna
=== begin code ===
50000000 constant iterations
: faccum create 1 floats allot? 0.0e f!
does> dup f@ f+ fdup f! ;
: faccum-part2 ( F: r1 -- r2 ) ( a -- )
dup f@ f+ fdup f! ;
faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop
: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
loop ;
=== end code ===
On 7/14/24 07:20, Krishna Myneni wrote:
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
Let's check. In kForth-64, an indirect threaded code system,
.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok
So b5 appears to be more efficient that b4 ( the version with DOES> ).
--
Krishna
=== begin code ===
50000000 constant iterations
: faccum create 1 floats allot? 0.0e f!
does> dup f@ f+ fdup f! ;
: faccum-part2 ( F: r1 -- r2 ) ( a -- )
dup f@ f+ fdup f! ;
faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop
: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
loop ;
=== end code ===
Using perf to obtain the microbenchmarks for B4 and B5,
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th
b4 f. cr bye':
14_381_951_937 cycles:u
26_206_810_946 instructions:u # 1.82 insn per cycle
58_563 L1-dcache-load-misses:u
14_742 L1-icache-load-misses:u
100_122_231 branch-misses:u
4.501011307 seconds time elapsed
4.477172000 seconds user
0.003967000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th
b5 f. cr bye':
11_529_644_734 cycles:u
18_906_809_683 instructions:u # 1.64 insn per cycle
59_605 L1-dcache-load-misses:u
21_531 L1-icache-load-misses:u
100_109_360 branch-misses:u
3.616353010 seconds time elapsed
3.600206000 seconds user
0.004639000 seconds sys
It appears that the cache misses are fairly small for both b4 and b5,
but the branch misses are very high in my system.
FlashForth.
This post inspired me to remove all the DOLIT DODOES DOCREATE and DOUSER >stuff and to instead
compile a inline code literal and a jump to the common class code.
CREATE was modified to compile one inline literal, a RETURN and a NOP,
the literal points to the data area. Later the RETURN and NOP are
replaced by a JUMP to code after DOES> or in USER.
The PIC18FX, has a nice feature, there is an instruction to push aMucho spectacular. ;-)
literal to the parameter stack
Thanks, Mike
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 504 |
Nodes: | 16 (2 / 14) |
Uptime: | 03:12:07 |
Calls: | 9,896 |
Calls today: | 5 |
Files: | 13,797 |
Messages: | 6,343,140 |
Posted today: | 3 |