• Implementing DOES>: How not to do it (and why not) and how to do it

    From Anton Ertl@21:1/5 to All on Thu Jul 11 14:06:02 2024
    At least one Forth system implements DOES> inefficiently, but I
    suspect that it's not alone in that. I have reported the issue to the
    system vendor, and hopefully they will fix it. Here I discuss the
    issue so that possibly other system implementors can avoid these
    pitfalls. I do not name the system here to protect the guilty, but
    make no additional effort at anonymizing it.

    Here's a microbenchmark that puts a spotlight on the issue. I have
    started investingating the issue because the system performed badly on
    an application benchmark that calls DOES>-defined words a lot, so this microbenchmark is not just contrived.

    50000000 constant iterations

    : d1 create 0 , does> ;
    : d1-part2 ;

    d1 x1
    d1 y1

    : b1 iterations 0 do x1 dup ! y1 dup ! loop ;
    : c1 iterations 0 do x1 drop y1 drop loop ;

    : b3
    iterations 0 do
    [ ' x1 >body ] literal d1-part2 dup !
    [ ' y1 >body ] literal d1-part2 dup !
    loop ;
    : c3
    iterations 0 do
    [ ' x1 >body ] literal d1-part2 drop
    [ ' y1 >body ] literal d1-part2 drop
    loop ;

    B1 and C1 use the DOES>-defined words in the way the system implements
    them, while B3 and C3 use them in the way the system should implement
    them (when COMPILE,ing the xt of X1 and Y1): Put the address of the
    body on the data stack as a literal and then call the code behind
    DOES> (or in this case a colon definition that contains the code).

    Let's see the performance (all numbers from a Tiger Lake), first for
    C3/C1:

    perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"


    Performance counter stats for 'sf64 include does-microbench.fs c3 bye':

    402963130 cycles:u
    804105655 instructions:u # 2.00 insn per cycle
    83766 L1-dcache-load-misses
    1750283 L1-icache-load-misses
    36403 branch-misses

    0.114868603 seconds time elapsed

    The code for the X1 part of the loop here is:

    44EB26 -8 [RBP] RBP LEA 488D6DF8
    44EB2A RBX 0 [RBP] MOV 48895D00
    44EB2E 44E970 # EBX MOV BB70E94400
    44EB33 44E94D ( d1-part2 ) CALL E815FEFFFF
    44EB38 0 [RBP] RBX MOV 488B5D00
    44EB3C 8 [RBP] RBP LEA 488D6D08

    and the DS1-PART is:

    44E94D RET C3

    This loop takes 8 cycles per iteration (about half of that for each
    simulated DOES>-defined word). Now for C1:

    C1:

    3502930384 cycles:u
    903847649 instructions:u # 0.26 insn per cycle
    93579 L1-dcache-load-misses
    1813286 L1-icache-load-misses
    100033846 branch-misses

    0.846190766 seconds time elapsed

    This loop takes 70 cycles per iteration (i.e., almost 9 times slower)
    and has one branch misprediction per DOES>-defined word. Let's see
    why:

    The code in the loop for the X1 part is:

    44EA42 44E96B ( x1 ) CALL E824FFFFFF
    44EA47 0 [RBP] RBX MOV 488B5D00
    44EA4B 8 [RBP] RBP LEA 488D6D08

    It calls X1:

    44E96B 44E927 ( d1 +1C ) CALL E8B7FFFFFF

    which in turn calls this code:

    44E927 -8 [RBP] RBP LEA 488D6DF8
    44E92B RBX 0 [RBP] MOV 48895D00
    44E92F RBX POP 5B
    44E930 RET C3

    Here the return address of the call at 44E96B is popped and used as
    body address for X1. However, the hardware return stack predicts that
    the following RET returns to that return address, which is a
    misprediction, because the RET actually returns to the return address
    of the outer call (44EA47). You can see here that mispredictions are expensive.

    Let's turn to B1. The code for the X1 part of the loop is:

    44E9CE 44E96B ( x1 ) CALL E898FFFFFF
    44E9D3 -8 [RBP] RBP LEA 488D6DF8
    44E9D7 RBX 0 [RBP] MOV 48895D00
    44E9DB 0 [RBP] RAX MOV 488B4500
    44E9DF RAX 0 [RBX] MOV 488903
    44E9E2 8 [RBP] RBX MOV 488B5D08
    44E9E6 10 [RBP] RBP LEA 488D6D10

    The results are:

    44758764805 cycles:u
    1303768694 instructions:u # 0.03 insn per cycle
    150154275 L1-dcache-load-misses
    202142121 L1-icache-load-misses
    100051173 branch-misses

    10.699859443 seconds time elapsed

    10.696626000 seconds user
    0.004000000 seconds sys

    So in addition to the branch mispredictions that we also saw in C1, we
    see 3 D-cache misses per iteration and 4 I-cache misses per iteration, resulting in 895 cycles/iteration. A part of this cache-ping-pong is
    because the call at 44E96B is in the same cache line as the stored-to
    data at 44E970. The store wants the cache line in the D-cache, while
    the call to 44E96B wants it in the I-cache. This is an issue I have
    pointed out here for the first time in 1995, and Forth systems still
    have not fixed it completely 29 years later.

    Let's see if the B3 variant escapes this problem:

    0.106679000 seconds user
    0.008206000 seconds sys
    20590473606 cycles:u
    1204106872 instructions:u # 0.06 insn per cycle
    50145367 L1-dcache-load-misses
    101982668 L1-icache-load-misses
    49127 branch-misses

    4.932825183 seconds time elapsed

    4.926391000 seconds user
    0.003998000 seconds sys

    It's better, but there is still one D-cache miss and 2 I-cache misses
    per iteration. It seems that the distance between the D1-PART2 code
    and X1 data is not enough to completely avoid the issue (but this is
    just guessing).

    In any case, having the call right before the data and executing it on
    every call to a does>-defined word is responsible for 2 I-cache misses
    and 2 D-cache misses per iteration, so it should be avoided by
    generating code like in C3 and B3.

    For dealing with the remaining cache consistency problems, most Forth
    systems have chosen to put padding between code and data, and increase
    the padding in response to slowness reports. Another alternative is
    to put the code in a separate memory region than the data.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2024: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Jul 13 15:31:38 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    At least one Forth system implements DOES> inefficiently, but I
    suspect that it's not alone in that.

    And indeed, a second system has the same problem; it shows up more
    rarely, because normally this system inlines does>-defined words, but
    when it does not, it performs badly.

    Here's a microbenchmark where the second system does not inline the does-defined word:

    50000000 constant iterations
    : faccum
    create 0e f,
    does> ( r1 -- r2 )
    dup f@ f+ fdup f! ;

    : faccum-part2 ( r1 addr -- r2 )
    dup f@ f+ fdup f! ;

    faccum x4 \ 2e x4 fdrop
    faccum y4 \ -4e y4 fdrop

    : b4 0e iterations 0 do x4 y4 loop ;
    : b5 0e iterations 0 do
    [ ' x4 >body ] literal faccum-part2
    [ ' y4 >body ] literal faccum-part2
    loop ;


    First, let's see what the Forth systems do by themselves (the B4 microbenchmark); numbers from a Skylake; I have replaced the names of
    the Forth systems with inefficient DOES> implementations with A and B.

    [~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"
    0.

    Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b4 f. cr bye':

    948_628_907 cycles:u
    3_695_796_028 instructions:u # 3.90 insn per cycle
    1_154_670 L1-dcache-load-misses
    198_627 L1-icache-load-misses
    306_507 branch-misses

    0.245984689 seconds time elapsed

    0.244894000 seconds user
    0.000000000 seconds sys


    [~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b4 f. cr bye"
    0.00000000


    Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':

    38_769_505_700 cycles:u
    1_704_476_397 instructions:u # 0.04 insn per cycle
    178_288_238 L1-dcache-load-misses
    250_454_606 L1-icache-load-misses
    100_090_310 branch-misses

    9.719803719 seconds time elapsed

    9.715343000 seconds user
    0.000000000 seconds sys


    [~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b4 f. cr bye"

    Including does-microbench.fs0.


    Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':

    39_200_313_445 cycles:u
    1_413_936_888 instructions:u # 0.04 insn per cycle
    150_445_572 L1-dcache-load-misses
    209_127_540 L1-icache-load-misses
    100_128_427 branch-misses

    9.822342252 seconds time elapsed

    9.817016000 seconds user
    0.000000000 seconds sys

    So both A and B fall into the cache-ping-pong and the return stack misprediction pitfalls in this case, resulting in a factor 40 slowdown
    compared to Gforth.

    Let's see how it works if we use the code I suggest (simulated in B5):

    [~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"
    0.

    Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b5 f. cr bye':

    943_277_009 cycles:u
    3_295_795_332 instructions:u # 3.49 insn per cycle
    1_147_107 L1-dcache-load-misses
    198_364 L1-icache-load-misses
    295_186 branch-misses

    0.247765182 seconds time elapsed

    0.242645000 seconds user
    0.004044000 seconds sys


    [~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b5 f. cr bye"
    0.00000000


    Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':

    23_587_381_659 cycles:u
    1_604_475_561 instructions:u # 0.07 insn per cycle
    100_111_296 L1-dcache-load-misses
    100_502_420 L1-icache-load-misses
    77_126 branch-misses

    6.055177414 seconds time elapsed

    6.055288000 seconds user
    0.000000000 seconds sys


    [~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b5 f. cr bye"

    Including does-microbench.fs0.

    Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':

    949_044_323 cycles:u
    1_313_933_897 instructions:u # 1.38 insn per cycle
    246_252 L1-dcache-load-misses
    105_517 L1-icache-load-misses
    61_449 branch-misses

    0.239750023 seconds time elapsed

    0.239811000 seconds user
    0.000000000 seconds sys

    This solves both problems for B, but A still suffers from
    cache ping-pong; I suspect that this is because there is not enough
    distance between the modified data and FACCUM-PART2 (or, less likely,
    not enough distance between the modified data and the loop in B5).

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the does-defined word in a case where that word is not inlined.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2024: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Sun Jul 14 11:02:52 2024
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the >does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.


    - anton

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 08:00:17 2024
    On 7/14/24 07:20, Krishna Myneni wrote:
    On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the
    does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.



    Let's check. In kForth-64, an indirect threaded code system,

    .s
    <empty>
     ok
    f.s
    fs: <empty>
     ok
    ms@ b4 ms@ swap - .
    4274  ok
    ms@ b5 ms@ swap - .
    3648  ok

    So b5 appears to be more efficient that b4 ( the version with DOES> ).


    This discrepancy between b4 and b5 is due to the fact that, in kForth,
    DOES> not inline the code into the created word. So there is room for improvement in the implementation of DOES> in kForth. There are other
    important pressing issues to deal with, such as making the User's Manual
    more useful (the most important item) and adding other standardized
    words, but improving DOES> is low-hanging fruit.

    --
    Krishna


    === begin code ===
    // DOES> ( -- )
    //
    // Forth 2012
    int CPP_does()
    {
    // Allocate new opcode array

    byte* p = new byte[2*WSIZE+4];

    // Insert pfa of last word in dictionary

    p[0] = OP_ADDR;
    WordListEntry* pWord = *(pCompilationWL->end() - 1);
    *((long int*)(p+1)) = (long int) pWord->Pfa;

    // Insert current instruction ptr

    p[WSIZE+1] = OP_ADDR;
    *((long int*)(p+WSIZE+2)) = (long int)(GlobalIp + 1);

    p[2*WSIZE+2] = OP_EXECUTE_BC;
    p[2*WSIZE+3] = OP_RET;

    pWord->Cfa = (void*) p;
    pWord->WordCode = OP_DEFINITION;

    L_ret();
    return 0;
    }
    === end code ==

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to albert@spenarnc.xs4all.nl on Sun Jul 14 07:20:01 2024
    On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the
    does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.



    Let's check. In kForth-64, an indirect threaded code system,

    .s
    <empty>
    ok
    f.s
    fs: <empty>
    ok
    ms@ b4 ms@ swap - .
    4274 ok
    ms@ b5 ms@ swap - .
    3648 ok

    So b5 appears to be more efficient that b4 ( the version with DOES> ).

    --
    Krishna

    === begin code ===
    50000000 constant iterations

    : faccum create 1 floats allot? 0.0e f!
    does> dup f@ f+ fdup f! ;

    : faccum-part2 ( F: r1 -- r2 ) ( a -- )
    dup f@ f+ fdup f! ;

    faccum x4 2.0e x4 fdrop
    faccum y4 -4.0e y4 fdrop

    : b4 0.0e iterations 0 do x4 y4 loop ;
    : b5 0.0e iterations 0 do
    [ ' x4 >body ] literal faccum-part2
    [ ' y4 >body ] literal faccum-part2
    loop ;
    === end code ===

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@spenarnc.xs4all.nl on Sun Jul 14 13:00:15 2024
    albert@spenarnc.xs4all.nl writes:
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the >>does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.

    It depends on how the system is implemented. In fig-Forth, the code
    for the primitives is mixed with the data. In such a system, if you
    have memory close to the primitive that is frequently being written
    to, you will see cache ping-pong. If the native code of the
    primitives is in a separate area with enough distance to data, there
    should be no such issues for primitives.

    Also, there are at least two ways to implement DOES>-defined words:

    1) In addition to the code address of DODOES, have an extra cell
    pointing to the code after DOES> so that DODOES can find it. This
    is used in fig-Forth with <BUILDS (which reserves the extra cell in
    fig-Forth, whereas CREATE does not and cannot be used with DOES> in
    fig-Forth). In Gforth all words including CREATEd words have a
    two-cell code field since the beginning, and indirect threaded
    variants of Gforth (including the hybrid direct/indirect threaded
    approach we have used on all architectures since 2001) have used
    this. With the new header the details are a bit different, but the
    principle is the same.

    2) The F83 way of implementing CREATE ... DOES> (others probably have
    used this way earlier, and it probably led to the introduction of
    CREATE...DOES>, but F83 is a system that I find documentation
    about): There is only a single cell at the code field, and it
    points to a native-code CALL DODOES that sits between the (DOES>)
    and the threaded code for the source code behind the DOES>. DODOES
    then pops the return address of the call, and this is the address
    of the threaded code to be called by DODOES>, while the data
    address is in W, as usual. Gforth used a similar approach for
    direct-threaded implementations (until 2001), but IIRC without the
    calling and popping. It seems that the DOES> implementation of
    systems A and B were inspired by this approach.

    Inside F83 <https://www.forth.org/OffeteStore/1003_InsideF83.pdf>
    discusses this starting in Section "High Level Inner Interpreter"
    on page 45, but IMO Ting uses confusing terminology here: What I
    call run-time routines for defining words, Ting calls "inner
    interpreter" (for me the "inner interpreter" is NEXT).

    For the first way, no cache ping-pong nor a particularly high level of
    branch mispredictions is expected.

    For the second way, I expect cache ping-pong if there is written data
    close to the DOES>. I don't expect a particularly high way of branch mispredictions in indirect threaded code: While the hardware return
    stack will get out of sync because of the call-pop usage, indirect
    threaded code does not have returns that would mispredict because of
    this lack of synchronization.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2024: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 13:32:19 2024
    On 7/14/24 07:20, Krishna Myneni wrote:
    On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the
    does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.



    Let's check. In kForth-64, an indirect threaded code system,

    .s
    <empty>
     ok
    f.s
    fs: <empty>
     ok
    ms@ b4 ms@ swap - .
    4274  ok
    ms@ b5 ms@ swap - .
    3648  ok

    So b5 appears to be more efficient that b4 ( the version with DOES> ).

    --
    Krishna

    === begin code ===
    50000000 constant iterations

    : faccum  create 1 floats allot? 0.0e f!
        does> dup f@ f+ fdup f! ;

    : faccum-part2 ( F: r1 -- r2 ) ( a -- )
        dup f@ f+ fdup f! ;

    faccum x4  2.0e x4 fdrop
    faccum y4 -4.0e y4 fdrop

    : b4 0.0e iterations 0 do x4 y4 loop ;
    : b5 0.0e iterations 0 do
        [ ' x4 >body ] literal faccum-part2
        [ ' y4 >body ] literal faccum-part2
      loop ;
    === end code ===





    Using perf to obtain the microbenchmarks for B4 and B5,

    B4

    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include does-microbench.4th b4 f. cr bye"
    -inf
    Goodbye.

    Performance counter stats for 'kforth64 -e include does-microbench.4th
    b4 f. cr bye':

    14_381_951_937 cycles:u

    26_206_810_946 instructions:u # 1.82 insn per cycle

    58_563 L1-dcache-load-misses:u

    14_742 L1-icache-load-misses:u

    100_122_231 branch-misses:u


    4.501011307 seconds time elapsed

    4.477172000 seconds user
    0.003967000 seconds sys


    B5

    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include does-microbench.4th b5 f. cr bye"
    -inf
    Goodbye.

    Performance counter stats for 'kforth64 -e include does-microbench.4th
    b5 f. cr bye':

    11_529_644_734 cycles:u

    18_906_809_683 instructions:u # 1.64 insn per
    cycle
    59_605 L1-dcache-load-misses:u

    21_531 L1-icache-load-misses:u

    100_109_360 branch-misses:u


    3.616353010 seconds time elapsed

    3.600206000 seconds user
    0.004639000 seconds sys


    It appears that the cache misses are fairly small for both b4 and b5,
    but the branch misses are very high in my system.

    --
    Krishna

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Jul 14 14:28:33 2024
    On 7/14/24 13:32, Krishna Myneni wrote:
    On 7/14/24 07:20, Krishna Myneni wrote:
    On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
    In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    In any case, if you are a system implementor, you may want to check
    your DOES> implementation with a microbenchmark that stores into the
    does-defined word in a case where that word is not inlined.

    Is that equally valid for indirect threaded code?
    In indirect threaded code the instruction and data cache
    are more separated, e.g. in a simple Forth all the low level
    code could fit in the I-cache, if I'm not mistaken.



    Let's check. In kForth-64, an indirect threaded code system,

    .s
    <empty>
      ok
    f.s
    fs: <empty>
      ok
    ms@ b4 ms@ swap - .
    4274  ok
    ms@ b5 ms@ swap - .
    3648  ok

    So b5 appears to be more efficient that b4 ( the version with DOES> ).

    --
    Krishna

    === begin code ===
    50000000 constant iterations

    : faccum  create 1 floats allot? 0.0e f!
         does> dup f@ f+ fdup f! ;

    : faccum-part2 ( F: r1 -- r2 ) ( a -- )
         dup f@ f+ fdup f! ;

    faccum x4  2.0e x4 fdrop
    faccum y4 -4.0e y4 fdrop

    : b4 0.0e iterations 0 do x4 y4 loop ;
    : b5 0.0e iterations 0 do
         [ ' x4 >body ] literal faccum-part2
         [ ' y4 >body ] literal faccum-part2
       loop ;
    === end code ===





    Using perf to obtain the microbenchmarks for B4 and B5,

    B4

    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include does-microbench.4th b4 f. cr bye"
    -inf
    Goodbye.

     Performance counter stats for 'kforth64 -e include does-microbench.4th
    b4 f. cr bye':

           14_381_951_937      cycles:u
           26_206_810_946      instructions:u     #    1.82  insn per cycle
                 58_563        L1-dcache-load-misses:u
                 14_742        L1-icache-load-misses:u
             100_122_231       branch-misses:u

           4.501011307 seconds time elapsed

           4.477172000 seconds user
           0.003967000 seconds sys


    B5

    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include does-microbench.4th b5 f. cr bye"
    -inf
    Goodbye.

     Performance counter stats for 'kforth64 -e include does-microbench.4th
    b5 f. cr bye':

           11_529_644_734      cycles:u
           18_906_809_683      instructions:u      #    1.64  insn per cycle
                 59_605        L1-dcache-load-misses:u
                 21_531        L1-icache-load-misses:u
             100_109_360       branch-misses:u

           3.616353010 seconds time elapsed

           3.600206000 seconds user
           0.004639000 seconds sys


    It appears that the cache misses are fairly small for both b4 and b5,
    but the branch misses are very high in my system.



    The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
    On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch
    misses were quite few.

    B4
    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include faccum.4th b4 f. cr bye"
    0
    Goodbye.

    Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr
    bye':

    7_847_499_582 cycles:u

    26_206_205_780 instructions:u # 3.34 insn per cycle

    67_785 L1-dcache-load-misses:u

    65_391 L1-icache-load-misses:u

    38_308 branch-misses:u


    2.014078890 seconds time elapsed

    2.010013000 seconds user
    0.000999000 seconds sys

    B5
    $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
    -e "include faccum.4th b5 f. cr bye"
    0
    Goodbye.

    Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr
    bye':

    5_314_718_609 cycles:u

    18_906_206_178 instructions:u # 3.56 insn per cycle

    64_150 L1-dcache-load-misses:u

    44_818 L1-icache-load-misses:u

    29_941 branch-misses:u


    1.372367863 seconds time elapsed

    1.367289000 seconds user
    0.002989000 seconds sys


    The efficiency difference is due entirely to the number of instructions
    being executed for B4 and B5.

    --
    KM

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to FFmike on Sun Aug 4 13:24:42 2024
    In article <c2588b8c811fd3ae75d3976c3a927fc3@www.novabbs.com>,
    FFmike <oh2aun@gmail.com> wrote:
    FlashForth.

    This post inspired me to remove all the DOLIT DODOES DOCREATE and DOUSER >stuff and to instead
    compile a inline code literal and a jump to the common class code.

    CREATE was modified to compile one inline literal, a RETURN and a NOP,
    the literal points to the data area. Later the RETURN and NOP are
    replaced by a JUMP to code after DOES> or in USER.

    Afterwards a second DOES> is required to change the behaviour.
    Is that possible with this setup.
    <SNIP>

    The PIC18FX, has a nice feature, there is an instruction to push a
    literal to the parameter stack
    Mucho spectacular. ;-)
    <SNIP>


    Thanks, Mike

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)