Forum: >>> Magnum BBS <<<

LOOP (was: OOS approach revisited)

From Anton Ertl@21:1/5 to minforth on Sat Jun 28 10:23:51 2025

minforth <minforth@gmx.net> writes:

Most CPUs have operators for register-based count-down loops
that are blazingly fast.

Which "operators" do you have in mind, and what do you mean with
"blazingly fast".

Anyway, we have discussed this repeatedly, e.g., in <2022Feb13.231208@mips.complang.tuwien.ac.at> I wrote in reply to your
posting <f4b89e0b-2ded-4b18-8dc1-bba6dcda47bbn@googlegroups.com>, and
cited earlier discussions in the topic.

|"minf...@arcor.de" <minforth@arcor.de> writes:
[...]

F.ex. match NEXT efficiently to x_86 processor LOOP instruction (counter in=
_CX register)
and you'll happily count down from 5 to 1.

|
|Yes, but why would one do this? As we have established in an earlier |discussion (see below), the LOOP instruction is typically not faster
|than a sequence of simpler instructions:
|
|<2018Jun6.184616@mips.complang.tuwien.ac.at>:
||minforth@arcor.de writes:

FOR..NEXT matches easily with the x86 LOOP instruction and ECX as counter. ||>Should do speedy enough. ;-)

||
||Have you measured it? I have
||<2017Mar14.183125@mips.complang.tuwien.ac.at> ||<2017Mar15.141411@mips.complang.tuwien.ac.at> and compared the
||following loops:
||
||.L5: .L5:
|| subq $1, %rax loop .L5
|| jne .L5
||
||I found that for these loops Sandy Bridge, Haswell, and Skylake take
||~4 cycles per iteration using LOOP, and 1-2 cycles per iteration when
||using jne.
|
|<2018Jun7.141731@mips.complang.tuwien.ac.at>:
||cycles for 1000 iterations
|| K10 Excavator Zen
||Phenom II Athlon X4 845 Ryzen 1600X
|| 3021 1314 1051 loop
|| 2020 1484 1051 sub; jne
|| 2026 1489 1053 add; cmp; jne
|
|There is no performance advantage on modern AMD and Intel CPUs for the |instruction LOOP over a good implementation of the Forth word LOOP (as
|in the third example).

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

You obviously ignore repeated refutations of your claims of superior performance for LOOP-instruction-based counted loops. Maybe you
should implement and measure such a counted loop yourself and compare
it to the LOOP word on SwiftForth and VFX Forth.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Sat Jun 28 14:26:08 2025

In article <2025Jun28.122351@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

minforth <minforth@gmx.net> writes:

Most CPUs have operators for register-based count-down loops
that are blazingly fast.

Which "operators" do you have in mind, and what do you mean with
"blazingly fast".

Anyway, we have discussed this repeatedly, e.g., in ><2022Feb13.231208@mips.complang.tuwien.ac.at> I wrote in reply to your >posting <f4b89e0b-2ded-4b18-8dc1-bba6dcda47bbn@googlegroups.com>, and
cited earlier discussions in the topic.

|"minf...@arcor.de" <minforth@arcor.de> writes:
[...]

F.ex. match NEXT efficiently to x_86 processor LOOP instruction (counter in= >|> _CX register)
and you'll happily count down from 5 to 1.

|
|Yes, but why would one do this? As we have established in an earlier >|discussion (see below), the LOOP instruction is typically not faster
|than a sequence of simpler instructions:
|
|<2018Jun6.184616@mips.complang.tuwien.ac.at>:
||minforth@arcor.de writes:

FOR..NEXT matches easily with the x86 LOOP instruction and ECX as counter. >||>Should do speedy enough. ;-)

||
||Have you measured it? I have >||<2017Mar14.183125@mips.complang.tuwien.ac.at> >||<2017Mar15.141411@mips.complang.tuwien.ac.at> and compared the
||following loops:
||
||.L5: .L5:
|| subq $1, %rax loop .L5
|| jne .L5
||
||I found that for these loops Sandy Bridge, Haswell, and Skylake take
||~4 cycles per iteration using LOOP, and 1-2 cycles per iteration when >||using jne.
|
|<2018Jun7.141731@mips.complang.tuwien.ac.at>:
||cycles for 1000 iterations
|| K10 Excavator Zen
||Phenom II Athlon X4 845 Ryzen 1600X
|| 3021 1314 1051 loop
|| 2020 1484 1051 sub; jne
|| 2026 1489 1053 add; cmp; jne
|
|There is no performance advantage on modern AMD and Intel CPUs for the >|instruction LOOP over a good implementation of the Forth word LOOP (as
|in the third example).

If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.

You obviously ignore repeated refutations of your claims of superior >performance for LOOP-instruction-based counted loops. Maybe you
should implement and measure such a counted loop yourself and compare
it to the LOOP word on SwiftForth and VFX Forth.

It is good to remember how the severe CISC instruction came about.
Early Intel processors 8086 were severely cramped in memory space.
While the 16 bit 68000 has a generous 32 bit space, they were restricted
to 16 bit. They added a segmented memory and 1 byte version for two byte instructions until finally the 386 arrived.
Compiler writers were not inclined to gain (or even loss) a moderate speed advantage to save a few bytes.
There is more to it.
The first thing e.g. I had to do in my optimiser to expand 1 byte
instructions to not explode the possibilities I had to consider.

- anton

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Jun 28 16:04:07 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

You obviously ignore repeated refutations of your claims of superior >performance for LOOP-instruction-based counted loops. Maybe you
should implement and measure such a counted loop yourself and compare
it to the LOOP word on SwiftForth and VFX Forth.

Actually it is possible to implement the LOOP word such that it uses
the LOOP instruction by modifying DO/?DO, I, and J to go along with
them.

E.g., VFX generates the following code for LOOP

( 0050A25D 49FFC6 ) INC R14
( 0050A260 49FFC7 ) INC R15
( 0050A263 71F8 ) JNO 0050A25D

Here R14 contains the index (I), and R15 contains
(index-limit) xor 2^63. This value is conditioned such that INC
will set the overflow flag when index reaches limit.

The benefit of the overflow-flag approach is only relevant for +LOOP.
We will ignore that for now, but return to it later.

Instead, you could keep limit-index in RCX. Then you could implement
a VFX-style LOOP as follows:

inc r14
loop <target>

The LOOP instruction will decrement RCX until it reaches 0, i.e.,
until index equals limit.

Ok, this still needs the additional INC instruction that you may want
to avoid. So let's look at SwiftForth. Here LOOP compiles to

4519C5 R14 INC 49FFC6
4519C8 4519BC JNO 0F81EEFFFFFF

So here R14 (not R15) contains (index-limit) xor 2^63. We will
discuss I later.

You could instead have limit-index in RCX, and then let the LOOP word
generate

loop <target>

Look, Ma, LOOP implemented with LOOP!

Now what about I? SwiftForth implements I by keeping limit xor 2^63
in R15. Then I is R15+R14 (and SwiftForth's I is implemented to
produce that computation).

For our LOOP-instruction-base LOOP, we could keep limit in, say, R15.
Then I is R15-RCX.

Now what about +LOOP ? In the usual case the increment is a constant.
For a positive increment, you can implement +LOOP as

add rcx, increment
jnc <target>

For a negative increment, you can implement +LOOP as

add rcx, increment
jc <target>

Only in the case of an increment that is unknown at compile time, you
have to resort to something with jno, maybe

mov r14, rcx
add rcx, increment
btc r14, 63
add r14, increment
jno <target>

You can't have everything:-)

Anyway, given that LOOP is slower than what SwiftForth uses for the
LOOP word now on several CPUs and not faster on almost all others,
nobody interested in performance will go there. But if there ever is
a performance advantage to using the LOOP instruction, we have this
option. So no need to resort to FOR ... NEXT even in that case.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 1 14:38:17 2025
  from Wales, Uk via Telnet
- Plume
  Tue Jul 1 13:20:01 2025
  from Uk via SSH
- Gwylbert
  Tue Jul 1 10:35:03 2025
  from Sydney, Nsw via Telnet
- Centurion
  Tue Jul 1 07:44:07 2025
  from Berea, Ohio via Telnet
- Plume
  Tue Jul 1 03:59:33 2025
  from Uk via Raw
- Plume
  Tue Jul 1 03:36:58 2025
  from Uk via Raw
- Bob Worm
  Mon Jun 30 21:46:33 2025
  from Wales, Uk via Telnet
- Bob Worm
  Mon Jun 30 16:56:01 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	502
Nodes:	16 (2 / 14)
Uptime:	216:41:22
Calls:	9,878
Calls today:	6
Files:	13,791
Messages:	6,205,686

LOOP (was: OOS approach revisited)

Who's Online

Recent Visitors

System Info