Forum: >>> Magnum BBS <<<

Implementing DOES>: How not to do it (and why not) and how to do it

From Anton Ertl@21:1/5 to All on Thu Jul 11 14:06:02 2024

At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that. I have reported the issue to the
system vendor, and hopefully they will fix it. Here I discuss the
issue so that possibly other system implementors can avoid these
pitfalls. I do not name the system here to protect the guilty, but
make no additional effort at anonymizing it.

Here's a microbenchmark that puts a spotlight on the issue. I have
started investingating the issue because the system performed badly on
an application benchmark that calls DOES>-defined words a lot, so this microbenchmark is not just contrived.

50000000 constant iterations

: d1 create 0 , does> ;
: d1-part2 ;

d1 x1
d1 y1

: b1 iterations 0 do x1 dup ! y1 dup ! loop ;
: c1 iterations 0 do x1 drop y1 drop loop ;

: b3
iterations 0 do
[ ' x1 >body ] literal d1-part2 dup !
[ ' y1 >body ] literal d1-part2 dup !
loop ;
: c3
iterations 0 do
[ ' x1 >body ] literal d1-part2 drop
[ ' y1 >body ] literal d1-part2 drop
loop ;

B1 and C1 use the DOES>-defined words in the way the system implements
them, while B3 and C3 use them in the way the system should implement
them (when COMPILE,ing the xt of X1 and Y1): Put the address of the
body on the data stack as a literal and then call the code behind
DOES> (or in this case a colon definition that contains the code).

Let's see the performance (all numbers from a Tiger Lake), first for
C3/C1:

perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"

Performance counter stats for 'sf64 include does-microbench.fs c3 bye':

402963130 cycles:u
804105655 instructions:u # 2.00 insn per cycle
83766 L1-dcache-load-misses
1750283 L1-icache-load-misses
36403 branch-misses

0.114868603 seconds time elapsed

The code for the X1 part of the loop here is:

44EB26 -8 [RBP] RBP LEA 488D6DF8
44EB2A RBX 0 [RBP] MOV 48895D00
44EB2E 44E970 # EBX MOV BB70E94400
44EB33 44E94D ( d1-part2 ) CALL E815FEFFFF
44EB38 0 [RBP] RBX MOV 488B5D00
44EB3C 8 [RBP] RBP LEA 488D6D08

and the DS1-PART is:

44E94D RET C3

This loop takes 8 cycles per iteration (about half of that for each
simulated DOES>-defined word). Now for C1:

C1:

3502930384 cycles:u
903847649 instructions:u # 0.26 insn per cycle
93579 L1-dcache-load-misses
1813286 L1-icache-load-misses
100033846 branch-misses

0.846190766 seconds time elapsed

This loop takes 70 cycles per iteration (i.e., almost 9 times slower)
and has one branch misprediction per DOES>-defined word. Let's see
why:

The code in the loop for the X1 part is:

44EA42 44E96B ( x1 ) CALL E824FFFFFF
44EA47 0 [RBP] RBX MOV 488B5D00
44EA4B 8 [RBP] RBP LEA 488D6D08

It calls X1:

44E96B 44E927 ( d1 +1C ) CALL E8B7FFFFFF

which in turn calls this code:

44E927 -8 [RBP] RBP LEA 488D6DF8
44E92B RBX 0 [RBP] MOV 48895D00
44E92F RBX POP 5B
44E930 RET C3

Here the return address of the call at 44E96B is popped and used as
body address for X1. However, the hardware return stack predicts that
the following RET returns to that return address, which is a
misprediction, because the RET actually returns to the return address
of the outer call (44EA47). You can see here that mispredictions are expensive.

Let's turn to B1. The code for the X1 part of the loop is:

44E9CE 44E96B ( x1 ) CALL E898FFFFFF
44E9D3 -8 [RBP] RBP LEA 488D6DF8
44E9D7 RBX 0 [RBP] MOV 48895D00
44E9DB 0 [RBP] RAX MOV 488B4500
44E9DF RAX 0 [RBX] MOV 488903
44E9E2 8 [RBP] RBX MOV 488B5D08
44E9E6 10 [RBP] RBP LEA 488D6D10

The results are:

44758764805 cycles:u
1303768694 instructions:u # 0.03 insn per cycle
150154275 L1-dcache-load-misses
202142121 L1-icache-load-misses
100051173 branch-misses

10.699859443 seconds time elapsed

10.696626000 seconds user
0.004000000 seconds sys

So in addition to the branch mispredictions that we also saw in C1, we
see 3 D-cache misses per iteration and 4 I-cache misses per iteration, resulting in 895 cycles/iteration. A part of this cache-ping-pong is
because the call at 44E96B is in the same cache line as the stored-to
data at 44E970. The store wants the cache line in the D-cache, while
the call to 44E96B wants it in the I-cache. This is an issue I have
pointed out here for the first time in 1995, and Forth systems still
have not fixed it completely 29 years later.

Let's see if the B3 variant escapes this problem:

0.106679000 seconds user
0.008206000 seconds sys
20590473606 cycles:u
1204106872 instructions:u # 0.06 insn per cycle
50145367 L1-dcache-load-misses
101982668 L1-icache-load-misses
49127 branch-misses

4.932825183 seconds time elapsed

4.926391000 seconds user
0.003998000 seconds sys

It's better, but there is still one D-cache miss and 2 I-cache misses
per iteration. It seems that the distance between the D1-PART2 code
and X1 data is not enough to completely avoid the issue (but this is
just guessing).

In any case, having the call right before the data and executing it on
every call to a does>-defined word is responsible for 2 I-cache misses
and 2 D-cache misses per iteration, so it should be avoided by
generating code like in C3 and B3.

For dealing with the remaining cache consistency problems, most Forth
systems have chosen to put padding between code and data, and increase
the padding in response to slowness reports. Another alternative is
to put the code in a separate memory region than the data.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Jul 13 15:31:38 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that.

And indeed, a second system has the same problem; it shows up more
rarely, because normally this system inlines does>-defined words, but
when it does not, it performs badly.

Here's a microbenchmark where the second system does not inline the does-defined word:

50000000 constant iterations
: faccum
create 0e f,
does> ( r1 -- r2 )
dup f@ f+ fdup f! ;

: faccum-part2 ( r1 addr -- r2 )
dup f@ f+ fdup f! ;

faccum x4 \ 2e x4 fdrop
faccum y4 \ -4e y4 fdrop

: b4 0e iterations 0 do x4 y4 loop ;
: b5 0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
loop ;

First, let's see what the Forth systems do by themselves (the B4 microbenchmark); numbers from a Skylake; I have replaced the names of
the Forth systems with inefficient DOES> implementations with A and B.

[~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"
0.

Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b4 f. cr bye':

948_628_907 cycles:u
3_695_796_028 instructions:u # 3.90 insn per cycle
1_154_670 L1-dcache-load-misses
198_627 L1-icache-load-misses
306_507 branch-misses

0.245984689 seconds time elapsed

0.244894000 seconds user
0.000000000 seconds sys

[~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b4 f. cr bye"
0.00000000

Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':

38_769_505_700 cycles:u
1_704_476_397 instructions:u # 0.04 insn per cycle
178_288_238 L1-dcache-load-misses
250_454_606 L1-icache-load-misses
100_090_310 branch-misses

9.719803719 seconds time elapsed

9.715343000 seconds user
0.000000000 seconds sys

[~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b4 f. cr bye"

Including does-microbench.fs0.

Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':

39_200_313_445 cycles:u
1_413_936_888 instructions:u # 0.04 insn per cycle
150_445_572 L1-dcache-load-misses
209_127_540 L1-icache-load-misses
100_128_427 branch-misses

9.822342252 seconds time elapsed

9.817016000 seconds user
0.000000000 seconds sys

So both A and B fall into the cache-ping-pong and the return stack misprediction pitfalls in this case, resulting in a factor 40 slowdown
compared to Gforth.

Let's see how it works if we use the code I suggest (simulated in B5):

[~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"
0.

Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b5 f. cr bye':

943_277_009 cycles:u
3_295_795_332 instructions:u # 3.49 insn per cycle
1_147_107 L1-dcache-load-misses
198_364 L1-icache-load-misses
295_186 branch-misses

0.247765182 seconds time elapsed

0.242645000 seconds user
0.004044000 seconds sys

[~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b5 f. cr bye"
0.00000000

Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':

23_587_381_659 cycles:u
1_604_475_561 instructions:u # 0.07 insn per cycle
100_111_296 L1-dcache-load-misses
100_502_420 L1-icache-load-misses
77_126 branch-misses

6.055177414 seconds time elapsed

6.055288000 seconds user
0.000000000 seconds sys

[~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b5 f. cr bye"

Including does-microbench.fs0.

Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':

949_044_323 cycles:u
1_313_933_897 instructions:u # 1.38 insn per cycle
246_252 L1-dcache-load-misses
105_517 L1-icache-load-misses
61_449 branch-misses

0.239750023 seconds time elapsed

0.239811000 seconds user
0.000000000 seconds sys

This solves both problems for B, but A still suffers from
cache ping-pong; I suspect that this is because there is not enough
distance between the modified data and FACCUM-PART2 (or, less likely,
not enough distance between the modified data and the loop in B5).

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the does-defined word in a case where that word is not inlined.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Sun Jul 14 11:02:52 2024

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the >does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

- anton

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 08:00:17 2024

On 7/14/24 07:20, Krishna Myneni wrote:

On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

Let's check. In kForth-64, an indirect threaded code system,

.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok

So b5 appears to be more efficient that b4 ( the version with DOES> ).

This discrepancy between b4 and b5 is due to the fact that, in kForth,
DOES> not inline the code into the created word. So there is room for improvement in the implementation of DOES> in kForth. There are other
important pressing issues to deal with, such as making the User's Manual
more useful (the most important item) and adding other standardized
words, but improving DOES> is low-hanging fruit.

--
Krishna

=== begin code ===
// DOES> ( -- )
//
// Forth 2012
int CPP_does()
{
// Allocate new opcode array

byte* p = new byte[2*WSIZE+4];

// Insert pfa of last word in dictionary

p[0] = OP_ADDR;
WordListEntry* pWord = *(pCompilationWL->end() - 1);
*((long int*)(p+1)) = (long int) pWord->Pfa;

// Insert current instruction ptr

p[WSIZE+1] = OP_ADDR;
*((long int*)(p+WSIZE+2)) = (long int)(GlobalIp + 1);

p[2*WSIZE+2] = OP_EXECUTE_BC;
p[2*WSIZE+3] = OP_RET;

pWord->Cfa = (void*) p;
pWord->WordCode = OP_DEFINITION;

L_ret();
return 0;
}
=== end code ==

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to albert@spenarnc.xs4all.nl on Sun Jul 14 07:20:01 2024

On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

Let's check. In kForth-64, an indirect threaded code system,

.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok

So b5 appears to be more efficient that b4 ( the version with DOES> ).

--
Krishna

=== begin code ===
50000000 constant iterations

: faccum create 1 floats allot? 0.0e f!
does> dup f@ f+ fdup f! ;

: faccum-part2 ( F: r1 -- r2 ) ( a -- )
dup f@ f+ fdup f! ;

faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop

: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
loop ;
=== end code ===

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to albert@spenarnc.xs4all.nl on Sun Jul 14 13:00:15 2024

albert@spenarnc.xs4all.nl writes:

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the >>does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

It depends on how the system is implemented. In fig-Forth, the code
for the primitives is mixed with the data. In such a system, if you
have memory close to the primitive that is frequently being written
to, you will see cache ping-pong. If the native code of the
primitives is in a separate area with enough distance to data, there
should be no such issues for primitives.

Also, there are at least two ways to implement DOES>-defined words:

1) In addition to the code address of DODOES, have an extra cell
pointing to the code after DOES> so that DODOES can find it. This
is used in fig-Forth with <BUILDS (which reserves the extra cell in
fig-Forth, whereas CREATE does not and cannot be used with DOES> in
fig-Forth). In Gforth all words including CREATEd words have a
two-cell code field since the beginning, and indirect threaded
variants of Gforth (including the hybrid direct/indirect threaded
approach we have used on all architectures since 2001) have used
this. With the new header the details are a bit different, but the
principle is the same.

2) The F83 way of implementing CREATE ... DOES> (others probably have
used this way earlier, and it probably led to the introduction of
CREATE...DOES>, but F83 is a system that I find documentation
about): There is only a single cell at the code field, and it
points to a native-code CALL DODOES that sits between the (DOES>)
and the threaded code for the source code behind the DOES>. DODOES
then pops the return address of the call, and this is the address
of the threaded code to be called by DODOES>, while the data
address is in W, as usual. Gforth used a similar approach for
direct-threaded implementations (until 2001), but IIRC without the
calling and popping. It seems that the DOES> implementation of
systems A and B were inspired by this approach.

Inside F83 <https://www.forth.org/OffeteStore/1003_InsideF83.pdf>
discusses this starting in Section "High Level Inner Interpreter"
on page 45, but IMO Ting uses confusing terminology here: What I
call run-time routines for defining words, Ting calls "inner
interpreter" (for me the "inner interpreter" is NEXT).

For the first way, no cache ping-pong nor a particularly high level of
branch mispredictions is expected.

For the second way, I expect cache ping-pong if there is written data
close to the DOES>. I don't expect a particularly high way of branch mispredictions in indirect threaded code: While the hardware return
stack will get out of sync because of the call-pop usage, indirect
threaded code does not have returns that would mispredict because of
this lack of synchronization.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2024: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 13:32:19 2024

On 7/14/24 07:20, Krishna Myneni wrote:

On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

Let's check. In kForth-64, an indirect threaded code system,

.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok

So b5 appears to be more efficient that b4 ( the version with DOES> ).

--
Krishna

=== begin code ===
50000000 constant iterations

: faccum create 1 floats allot? 0.0e f!
    does> dup f@ f+ fdup f! ;

: faccum-part2 ( F: r1 -- r2 ) ( a -- )
    dup f@ f+ fdup f! ;

faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop

: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
    [ ' x4 >body ] literal faccum-part2
    [ ' y4 >body ] literal faccum-part2
loop ;
=== end code ===

Using perf to obtain the microbenchmarks for B4 and B5,

B4

$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.

Performance counter stats for 'kforth64 -e include does-microbench.4th
b4 f. cr bye':

14_381_951_937 cycles:u

26_206_810_946 instructions:u # 1.82 insn per cycle

58_563 L1-dcache-load-misses:u

14_742 L1-icache-load-misses:u

100_122_231 branch-misses:u

4.501011307 seconds time elapsed

4.477172000 seconds user
0.003967000 seconds sys

B5

$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.

Performance counter stats for 'kforth64 -e include does-microbench.4th
b5 f. cr bye':

11_529_644_734 cycles:u

18_906_809_683 instructions:u # 1.64 insn per
cycle
59_605 L1-dcache-load-misses:u

21_531 L1-icache-load-misses:u

100_109_360 branch-misses:u

3.616353010 seconds time elapsed

3.600206000 seconds user
0.004639000 seconds sys

It appears that the cache misses are fairly small for both b4 and b5,
but the branch misses are very high in my system.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 14:28:33 2024

On 7/14/24 13:32, Krishna Myneni wrote:

On 7/14/24 07:20, Krishna Myneni wrote:

On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:

In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.

Let's check. In kForth-64, an indirect threaded code system,

.s
<empty>
  ok
f.s
fs: <empty>
  ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok

So b5 appears to be more efficient that b4 ( the version with DOES> ).

--
Krishna

=== begin code ===
50000000 constant iterations

: faccum create 1 floats allot? 0.0e f!
     does> dup f@ f+ fdup f! ;

: faccum-part2 ( F: r1 -- r2 ) ( a -- )
     dup f@ f+ fdup f! ;

faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop

: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
     [ ' x4 >body ] literal faccum-part2
     [ ' y4 >body ] literal faccum-part2
   loop ;
=== end code ===

Using perf to obtain the microbenchmarks for B4 and B5,

B4

$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.

Performance counter stats for 'kforth64 -e include does-microbench.4th
b4 f. cr bye':

       14_381_951_937      cycles:u
       26_206_810_946      instructions:u     #    1.82 insn per cycle
             58_563        L1-dcache-load-misses:u
             14_742        L1-icache-load-misses:u
         100_122_231       branch-misses:u

       4.501011307 seconds time elapsed

       4.477172000 seconds user
       0.003967000 seconds sys

B5

$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.

Performance counter stats for 'kforth64 -e include does-microbench.4th
b5 f. cr bye':

       11_529_644_734      cycles:u
       18_906_809_683      instructions:u      #    1.64 insn per cycle
             59_605        L1-dcache-load-misses:u
             21_531        L1-icache-load-misses:u
         100_109_360       branch-misses:u

       3.616353010 seconds time elapsed

       3.600206000 seconds user
       0.004639000 seconds sys

It appears that the cache misses are fairly small for both b4 and b5,
but the branch misses are very high in my system.

The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch
misses were quite few.

B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include faccum.4th b4 f. cr bye"
0
Goodbye.

Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr
bye':

7_847_499_582 cycles:u

26_206_205_780 instructions:u # 3.34 insn per cycle

67_785 L1-dcache-load-misses:u

65_391 L1-icache-load-misses:u

38_308 branch-misses:u

2.014078890 seconds time elapsed

2.010013000 seconds user
0.000999000 seconds sys

B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include faccum.4th b5 f. cr bye"
0
Goodbye.

Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr
bye':

5_314_718_609 cycles:u

18_906_206_178 instructions:u # 3.56 insn per cycle

64_150 L1-dcache-load-misses:u

44_818 L1-icache-load-misses:u

29_941 branch-misses:u

1.372367863 seconds time elapsed

1.367289000 seconds user
0.002989000 seconds sys

The efficiency difference is due entirely to the number of instructions
being executed for B4 and B5.

--
KM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to FFmike on Sun Aug 4 13:24:42 2024

In article <c2588b8c811fd3ae75d3976c3a927fc3@www.novabbs.com>,
FFmike <oh2aun@gmail.com> wrote:

FlashForth.

This post inspired me to remove all the DOLIT DODOES DOCREATE and DOUSER >stuff and to instead
compile a inline code literal and a jump to the common class code.

CREATE was modified to compile one inline literal, a RETURN and a NOP,
the literal points to the data area. Later the RETURN and NOP are
replaced by a JUMP to code after DOES> or in USER.

Afterwards a second DOES> is required to change the behaviour.
Is that possible with this setup.
<SNIP>

The PIC18FX, has a nice feature, there is an instruction to push a
literal to the parameter stack

Mucho spectacular. ;-)
<SNIP>

Thanks, Mike

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bagwaa
  Thu Jul 3 14:03:03 2025
  from Nottingham via Telnet
- Bagwaa
  Thu Jul 3 12:13:13 2025
  from Nottingham via Telnet
- Bagwaa
  Thu Jul 3 09:18:20 2025
  from Nottingham via Telnet
- Bob Worm
  Thu Jul 3 08:43:33 2025
  from Wales, Uk via Telnet
- Plume
  Thu Jul 3 02:50:36 2025
  from Uk via Raw
- Bagwaa
  Wed Jul 2 23:29:54 2025
  from Nottingham via Telnet
- Bagwaa
  Wed Jul 2 21:04:42 2025
  from Nottingham via Telnet
- Bagwaa
  Wed Jul 2 20:42:09 2025
  from Nottingham via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	504
Nodes:	16 (2 / 14)
Uptime:	03:12:07
Calls:	9,896
Calls today:	5
Files:	13,797
Messages:	6,343,140
Posted today:	3

Implementing DOES>: How not to do it (and why not) and how to do it

Who's Online

Recent Visitors

System Info