• LOOP (was: OOS approach revisited)

    From Anton Ertl@21:1/5 to minforth on Sat Jun 28 10:23:51 2025
    minforth <minforth@gmx.net> writes:
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    Which "operators" do you have in mind, and what do you mean with
    "blazingly fast".

    Anyway, we have discussed this repeatedly, e.g., in <2022Feb13.231208@mips.complang.tuwien.ac.at> I wrote in reply to your
    posting <f4b89e0b-2ded-4b18-8dc1-bba6dcda47bbn@googlegroups.com>, and
    cited earlier discussions in the topic.

    |"minf...@arcor.de" <minforth@arcor.de> writes:
    [...]
    F.ex. match NEXT efficiently to x_86 processor LOOP instruction (counter in=
    _CX register)
    and you'll happily count down from 5 to 1.
    |
    |Yes, but why would one do this? As we have established in an earlier |discussion (see below), the LOOP instruction is typically not faster
    |than a sequence of simpler instructions:
    |
    |<2018Jun6.184616@mips.complang.tuwien.ac.at>:
    ||minforth@arcor.de writes:
    FOR..NEXT matches easily with the x86 LOOP instruction and ECX as counter. ||>Should do speedy enough. ;-)
    ||
    ||Have you measured it? I have
    ||<2017Mar14.183125@mips.complang.tuwien.ac.at> ||<2017Mar15.141411@mips.complang.tuwien.ac.at> and compared the
    ||following loops:
    ||
    ||.L5: .L5:
    || subq $1, %rax loop .L5
    || jne .L5
    ||
    ||I found that for these loops Sandy Bridge, Haswell, and Skylake take
    ||~4 cycles per iteration using LOOP, and 1-2 cycles per iteration when
    ||using jne.
    |
    |<2018Jun7.141731@mips.complang.tuwien.ac.at>:
    ||cycles for 1000 iterations
    || K10 Excavator Zen
    ||Phenom II Athlon X4 845 Ryzen 1600X
    || 3021 1314 1051 loop
    || 2020 1484 1051 sub; jne
    || 2026 1489 1053 add; cmp; jne
    |
    |There is no performance advantage on modern AMD and Intel CPUs for the |instruction LOOP over a good implementation of the Forth word LOOP (as
    |in the third example).

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    You obviously ignore repeated refutations of your claims of superior performance for LOOP-instruction-based counted loops. Maybe you
    should implement and measure such a counted loop yourself and compare
    it to the LOOP word on SwiftForth and VFX Forth.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Sat Jun 28 14:26:08 2025
    In article <2025Jun28.122351@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    minforth <minforth@gmx.net> writes:
    Most CPUs have operators for register-based count-down loops
    that are blazingly fast.

    Which "operators" do you have in mind, and what do you mean with
    "blazingly fast".

    Anyway, we have discussed this repeatedly, e.g., in ><2022Feb13.231208@mips.complang.tuwien.ac.at> I wrote in reply to your >posting <f4b89e0b-2ded-4b18-8dc1-bba6dcda47bbn@googlegroups.com>, and
    cited earlier discussions in the topic.

    |"minf...@arcor.de" <minforth@arcor.de> writes:
    [...]
    F.ex. match NEXT efficiently to x_86 processor LOOP instruction (counter in= >|> _CX register)
    and you'll happily count down from 5 to 1.
    |
    |Yes, but why would one do this? As we have established in an earlier >|discussion (see below), the LOOP instruction is typically not faster
    |than a sequence of simpler instructions:
    |
    |<2018Jun6.184616@mips.complang.tuwien.ac.at>:
    ||minforth@arcor.de writes:
    FOR..NEXT matches easily with the x86 LOOP instruction and ECX as counter. >||>Should do speedy enough. ;-)
    ||
    ||Have you measured it? I have >||<2017Mar14.183125@mips.complang.tuwien.ac.at> >||<2017Mar15.141411@mips.complang.tuwien.ac.at> and compared the
    ||following loops:
    ||
    ||.L5: .L5:
    || subq $1, %rax loop .L5
    || jne .L5
    ||
    ||I found that for these loops Sandy Bridge, Haswell, and Skylake take
    ||~4 cycles per iteration using LOOP, and 1-2 cycles per iteration when >||using jne.
    |
    |<2018Jun7.141731@mips.complang.tuwien.ac.at>:
    ||cycles for 1000 iterations
    || K10 Excavator Zen
    ||Phenom II Athlon X4 845 Ryzen 1600X
    || 3021 1314 1051 loop
    || 2020 1484 1051 sub; jne
    || 2026 1489 1053 add; cmp; jne
    |
    |There is no performance advantage on modern AMD and Intel CPUs for the >|instruction LOOP over a good implementation of the Forth word LOOP (as
    |in the third example).

    If they can be used within Forth-based loop constructs
    I would expect a greater speed increase than what you measured.

    You obviously ignore repeated refutations of your claims of superior >performance for LOOP-instruction-based counted loops. Maybe you
    should implement and measure such a counted loop yourself and compare
    it to the LOOP word on SwiftForth and VFX Forth.

    It is good to remember how the severe CISC instruction came about.
    Early Intel processors 8086 were severely cramped in memory space.
    While the 16 bit 68000 has a generous 32 bit space, they were restricted
    to 16 bit. They added a segmented memory and 1 byte version for two byte instructions until finally the 386 arrived.
    Compiler writers were not inclined to gain (or even loss) a moderate speed advantage to save a few bytes.
    There is more to it.
    The first thing e.g. I had to do in my optimiser to expand 1 byte
    instructions to not explode the possibilities I had to consider.


    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Jun 28 16:04:07 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    You obviously ignore repeated refutations of your claims of superior >performance for LOOP-instruction-based counted loops. Maybe you
    should implement and measure such a counted loop yourself and compare
    it to the LOOP word on SwiftForth and VFX Forth.

    Actually it is possible to implement the LOOP word such that it uses
    the LOOP instruction by modifying DO/?DO, I, and J to go along with
    them.

    E.g., VFX generates the following code for LOOP

    ( 0050A25D 49FFC6 ) INC R14
    ( 0050A260 49FFC7 ) INC R15
    ( 0050A263 71F8 ) JNO 0050A25D

    Here R14 contains the index (I), and R15 contains
    (index-limit) xor 2^63. This value is conditioned such that INC
    will set the overflow flag when index reaches limit.

    The benefit of the overflow-flag approach is only relevant for +LOOP.
    We will ignore that for now, but return to it later.

    Instead, you could keep limit-index in RCX. Then you could implement
    a VFX-style LOOP as follows:

    inc r14
    loop <target>

    The LOOP instruction will decrement RCX until it reaches 0, i.e.,
    until index equals limit.

    Ok, this still needs the additional INC instruction that you may want
    to avoid. So let's look at SwiftForth. Here LOOP compiles to

    4519C5 R14 INC 49FFC6
    4519C8 4519BC JNO 0F81EEFFFFFF

    So here R14 (not R15) contains (index-limit) xor 2^63. We will
    discuss I later.

    You could instead have limit-index in RCX, and then let the LOOP word
    generate

    loop <target>

    Look, Ma, LOOP implemented with LOOP!

    Now what about I? SwiftForth implements I by keeping limit xor 2^63
    in R15. Then I is R15+R14 (and SwiftForth's I is implemented to
    produce that computation).

    For our LOOP-instruction-base LOOP, we could keep limit in, say, R15.
    Then I is R15-RCX.

    Now what about +LOOP ? In the usual case the increment is a constant.
    For a positive increment, you can implement +LOOP as

    add rcx, increment
    jnc <target>

    For a negative increment, you can implement +LOOP as

    add rcx, increment
    jc <target>

    Only in the case of an increment that is unknown at compile time, you
    have to resort to something with jno, maybe

    mov r14, rcx
    add rcx, increment
    btc r14, 63
    add r14, increment
    jno <target>

    You can't have everything:-)

    Anyway, given that LOOP is slower than what SwiftForth uses for the
    LOOP word now on several CPUs and not faster on almost all others,
    nobody interested in performance will go there. But if there ever is
    a performance advantage to using the LOOP instruction, we have this
    option. So no need to resort to FOR ... NEXT even in that case.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)