This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
For reasons I won't go into I've been writing some code to evaluate
memory performance on my AMD Ryzen 5 3400G.
It says in the stuff I've found that each core has an 8-way set
associative L1 data cache of 128k (and an L1 instruction cache); an L2
cache of 512k, also set associative; and there's an L3 cache of 4MB.
To measure the performance I have three nested loops.
The innermost one goes around a loop incrementing a series of 64 bit
memory locations. The length of the series is set by the outermost loop.
The middle one repeats the innermost loop so that the number of memory accesses is constant regardless of the series length.
The outermost one sets the series length. It starts at 1, and doubles it
each time.
I _thought_ what would happen is that as I increase the length of the
series after a while the data won't fit in the cache, and I'll see a
sudden slowdown.
What I actually see is:
With a series length of 56 to 128 bytes I get the highest speed.
With a series length of 500B to 1.5MB, I get a consistent speed of about
2/3 the highest speed.
Once the series length exceeds 1.5MB the speed drops, and is consistent
from then on. That I can see is main memory speed, and is about 40% of
the highest.
OK so far.
But...
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.
WTF?
I can post the code (C++, but not very complex) if that would help.
Vir Campestris <vir.campestris@invalid.invalid> wrote:
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
For reasons I won't go into I've been writing some code to evaluate
memory performance on my AMD Ryzen 5 3400G.
It says in the stuff I've found that each core has an 8-way set
associative L1 data cache of 128k (and an L1 instruction cache); an L2
cache of 512k, also set associative; and there's an L3 cache of 4MB.
To measure the performance I have three nested loops.
The innermost one goes around a loop incrementing a series of 64 bit
memory locations. The length of the series is set by the outermost loop.
The middle one repeats the innermost loop so that the number of memory
accesses is constant regardless of the series length.
The outermost one sets the series length. It starts at 1, and doubles it
each time.
I _thought_ what would happen is that as I increase the length of the
series after a while the data won't fit in the cache, and I'll see a
sudden slowdown.
What I actually see is:
With a series length of 56 to 128 bytes I get the highest speed.
With a series length of 500B to 1.5MB, I get a consistent speed of about
2/3 the highest speed.
Once the series length exceeds 1.5MB the speed drops, and is consistent
from then on. That I can see is main memory speed, and is about 40% of
the highest.
OK so far.
But...
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main
memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.
WTF?
I can post the code (C++, but not very complex) if that would help.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
What instruction sequences are being generated for the 8/16/32/64 byte
loops? I'm wondering if the compiler is using different instructions,
eg using MMX, SSE, AVX to do the operations. Maybe they are having
different caching behaviour?
It would help if you could tell us the compiler and platform you're using, including version.
Theo
Vir Campestris <vir.campestris@invalid.invalid> wrote:
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
What instruction sequences are being generated for the 8/16/32/64 byte
loops? I'm wondering if the compiler is using different instructions,
eg using MMX, SSE, AVX to do the operations. Maybe they are having
different caching behaviour?
It would help if you could tell us the compiler and platform you're using, including version.
Can we see the code ??
Can you present a table of the timing results ??
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64 >>> bit word transferred?
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
On 21/12/2023 14:56, Scott Lurndal wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some
ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one
back.
And
Scott Lurndal wrote:
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
On 20/12/2023 18:58, MitchAlsup wrote:
Can we see the code ??
Can you present a table of the timing results ??
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.
Size 1 gave 3.82242e+09 bytes/second.} A
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
Size 4096 gave 3.26472e+09 bytes/second.} B
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.
The code will continue, but the results are roughly stable for larger sizes.
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the right places! I'd be interested to know what results other people get.
Thanks
Andy
On Thu, 21 Dec 2023 15:32:10 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:
On 21/12/2023 14:56, Scott Lurndal wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some
ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one
back.
And
DIMMs have 64-bit data buses. Both these days and previous millennium.
Now, these days DDR5 DIMM splits 64-bit data bus into pair of
independent 32-bit channels, but the total is still 64 bits.
That's a physical perspective. From logical perspective, DDR3 and DDR4
bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each channel exchanges data with controller in 512-bit chunks.
From signaling perspective, it is still possible (at least on non-ECC
gear) to tell to memory to write just 8 bits out of 512 and to ignore
the rest. In PC-class hardware this ability is used very rarely if used
at all.
Michael S wrote:
On Thu, 21 Dec 2023 15:32:10 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:
On 21/12/2023 14:56, Scott Lurndal wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where
is. Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some
ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
on hardware these days, but it used to be if you wanted to write a
byte into your 16-bit memory you had to read both bytes, then
write one back.
And
DIMMs have 64-bit data buses. Both these days and previous
millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
pair of independent 32-bit channels, but the total is still 64
bits.
All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
Some SDR (pre 2000) DIMMs had 32-bit busses,
some ancient laptop
memory carriers had 32-bit busses.
That's a physical perspective. From logical perspective, DDR3 and
DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
DIMM each channel exchanges data with controller in 512-bit chunks.
Note:: 512-bis is 64-bytes.
From signaling perspective, it is still possible (at least on
non-ECC gear) to tell to memory to write just 8 bits out of 512 and
to ignore the rest. In PC-class hardware this ability is used very
rarely if used at all.
Never justified when CPU uses a cache. So, only unCacheable (sized)
requests can use this. AMD memory controllers hid this from the DRAMs
so we at least had the opportunity to recompute ECC on ECC carrying
DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
then write the unCacheable data, and place the written data in the
outbound memory queue with its new ECC.
On Thu, 21 Dec 2023 18:36:42 +0000
mitchalsup@aol.com (MitchAlsup) wrote:
Michael S wrote:
On Thu, 21 Dec 2023 15:32:10 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:
On 21/12/2023 14:56, Scott Lurndal wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 20/12/2023 18:08, Theo wrote:
Vir Campestris <vir.campestris@invalid.invalid> wrote:<snip>
This is not the right group for this - but I don't know where
is. Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some
ideas.
Yes. My system has a 64 bit CPU and 64 bit main memory.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
on hardware these days, but it used to be if you wanted to write a
byte into your 16-bit memory you had to read both bytes, then
write one back.
And
DIMMs have 64-bit data buses. Both these days and previous
millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
pair of independent 32-bit channels, but the total is still 64
bits.
All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
Some SDR (pre 2000) DIMMs had 32-bit busses,
Are you sure? My impression always was that the word DIMM was invented
to clearly separate 64-bit DIMMs from 32-bit SIMMs.
some ancient laptop
memory carriers had 32-bit busses.
But were they called just DIMM or SO-DIMM?
The later indeed had 72-pin variant with 32-bit bus.
That's a physical perspective. From logical perspective, DDR3 and
DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
DIMM each channel exchanges data with controller in 512-bit chunks.
Note:: 512-bis is 64-bytes.
Which *not* co-incidentally matches cache line size of majority of x86
CPUs.
From signaling perspective, it is still possible (at least on
non-ECC gear) to tell to memory to write just 8 bits out of 512 and
to ignore the rest. In PC-class hardware this ability is used very
rarely if used at all.
Never justified when CPU uses a cache. So, only unCacheable (sized)
requests can use this. AMD memory controllers hid this from the DRAMs
so we at least had the opportunity to recompute ECC on ECC carrying
DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
then write the unCacheable data, and place the written data in the
outbound memory queue with its new ECC.
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.
{ A
Size 1 gave 3.82242e+09 bytes/second.} A
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
{ B
Size 4096 gave 3.26472e+09 bytes/second.} B
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.
A) Here we have a classical sequence pipelines often encounter where
a simple loop starts off fast 4 bytes/cycle and progressively slows
down by a factor of 2 (2 bytes per cycle) over some interval of
complexity (size). What is important is that factor of 2 something
that took 1 cycle early starts taking 2 cycles later on.
B) here we have a second classical sequence where the performance at
some boundary (4096) reverts back to the 1 cycle pipeline of performance
only to degrade again (in basically the same sequence as A). {{Side
note: size=4096 has "flutter" in the stairstep. size={8192..32768}
has peak performance--probably something to do with sets in the cache
and 4096 is the size of a page (TLB effects)}}.
I suspect the flutter has something to do with your buffer crossing a
page boundary.
I'd be interested to see the results from anyone else's computer.
$ g++ -o cache cache.c{ A
$ ./cache
Size 1 gave 4.13643e+09 bytes/second.} A
Size 2 gave 4.79971e+09 bytes/second.
Size 4 gave 4.87535e+09 bytes/second.
Size 8 gave 4.8321e+09 bytes/second.
Size 16 gave 4.71703e+09 bytes/second.
Size 32 gave 3.89488e+09 bytes/second.} B
Size 64 gave 4.02976e+09 bytes/second.
Size 128 gave 4.15832e+09 bytes/second.
Size 256 gave 4.19562e+09 bytes/second.
Size 512 gave 4.08511e+09 bytes/second.
Size 1024 gave 4.0796e+09 bytes/second.
Size 2048 gave 4.11983e+09 bytes/second.
Size 4096 gave 4.06869e+09 bytes/second.
Size 8192 gave 4.06807e+09 bytes/second.
Size 16384 gave 4.06217e+09 bytes/second.
Size 32768 gave 4.06067e+09 bytes/second.
Size 65536 gave 4.04791e+09 bytes/second.
Size 131072 gave 4.06143e+09 bytes/second.
Size 262144 gave 4.04301e+09 bytes/second.
Size 524288 gave 4.03872e+09 bytes/second.
Size 1048576 gave 3.97715e+09 bytes/second.
Size 2097152 gave 3.97609e+09 bytes/second.
Size 4194304 gave 3.98361e+09 bytes/second.
Size 8388608 gave 3.98617e+09 bytes/second.
Size 16777216 gave 3.98376e+09 bytes/second.
Size 33554432 gave 3.98504e+09 bytes/second.
Size 67108864 gave 3.98726e+09 bytes/second.
Size 134217728 gave 3.99495e+09 bytes/second.
Theo wrote:...
$ g++ -o cache cache.c{ A
$ ./cache
Size 1 gave 4.13643e+09 bytes/second.} A
Size 2 gave 4.79971e+09 bytes/second.
Size 4 gave 4.87535e+09 bytes/second.
Size 8 gave 4.8321e+09 bytes/second.
Size 16 gave 4.71703e+09 bytes/second.
{ B
Size 32 gave 3.89488e+09 bytes/second.
Size 134217728 gave 3.99495e+09 bytes/second.} B
The B group are essentially 4.0 with noise.
The A group are essentially 4.8 with a startup flutter.
A 20% down or 25% up change at 32::64.
In comp.arch MitchAlsup <mitchalsup@aol.com> wrote:
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Size 1 gave 6.22043e+09 bytes/second.} A
Size 2 gave 6.35674e+09 bytes/second.
Size 4 gave 4.14766e+09 bytes/second.
Size 8 gave 5.4239e+09 bytes/second.
Size 16 gave 6.01113e+09 bytes/second.
Size 32 gave 7.75976e+09 bytes/second.
Size 64 gave 8.7972e+09 bytes/second.
Size 128 gave 9.71523e+09 bytes/second.} B
Size 256 gave 9.91644e+09 bytes/second.
Size 512 gave 1.00179e+10 bytes/second.
Size 1024 gave 1.0065e+10 bytes/second.
Size 2048 gave 9.78508e+09 bytes/second.
Size 4096 gave 9.76764e+09 bytes/second.
Size 8192 gave 9.86537e+09 bytes/second.
Size 16384 gave 9.90053e+09 bytes/second.
Size 32768 gave 9.91552e+09 bytes/second.
Size 65536 gave 9.84556e+09 bytes/second.
Size 131072 gave 9.78442e+09 bytes/second.
Size 262144 gave 9.80282e+09 bytes/second.
Size 524288 gave 9.81447e+09 bytes/second.
Size 1048576 gave 9.81981e+09 bytes/second.
Size 2097152 gave 9.81456e+09 bytes/second.
Size 4194304 gave 9.70057e+09 bytes/second.} C
Size 8388608 gave 9.55507e+09 bytes/second.
Size 16777216 gave 9.44032e+09 bytes/second.
Size 33554432 gave 9.33896e+09 bytes/second.
Size 67108864 gave 9.28529e+09 bytes/second.
Size 134217728 gave 9.25213e+09 bytes/second.
which seems to have more flutter in group A. I get similar results in a VM running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.
Theo
which seems to have more flutter in group A. I get similar resultsin a VM
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.Thank you for those. Neither of them show the odd thing I was seeing -
Theo
On 22/12/2023 21:58, Theo wrote:
<snip>
which seems to have more flutter in group A. I get similar resultsin a VM
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.
TheoThank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?
Vir Campestris <vir.campestris@invalid.invalid> schrieb:
On 22/12/2023 21:58, Theo wrote:
<snip>
in a VM
which seems to have more flutter in group A. I get similar results
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.Thank you for those. Neither of them show the odd thing I was seeing -
Theo
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
On 23/12/2023 23:50, Thomas Koenig wrote:
Vir Campestris <vir.campestris@invalid.invalid> schrieb:
On 22/12/2023 21:58, Theo wrote:
<snip>
in a VM
which seems to have more flutter in group A. I get similar results
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.Thank you for those. Neither of them show the odd thing I was seeing -
Theo
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
Some programs are really fussy about capitalisation!
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1 and
higher levels usually makes far less difference. The processors
themselves do a good job of things like instruction scheduling and
register renaming at runtime, and are designed to be good for running
weakly optimised code. I find it makes a bigger difference on
processors that don't do as much at run-time, such as microcontroller
cores.
But the "-march=native" can make a very big difference, especially if it means the compiler can use SIMD or other advanced instructions. (You
don't need "-mtune" if you have "march", as it is implied by "-march" -
you only need both if you want to make a build that will run on many
x86-64 variants but is optimised for one particular one.)Â And the "-march=native" benefits go well with some of the higher optimisations -
thus it is good to combine "-march=native" with "-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of
program.
On 25/12/2023 14:56, David Brown wrote:
On 23/12/2023 23:50, Thomas Koenig wrote:
Vir Campestris <vir.campestris@invalid.invalid> schrieb:
On 22/12/2023 21:58, Theo wrote:
<snip>
in a VM
which seems to have more flutter in group A. I get similar results
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.Thank you for those. Neither of them show the odd thing I was seeing - >>>> but I compiled with -ofast. Does that make a difference?
Theo
That would put it in an executable named "fast" :-)
YKWIM :P
Some programs are really fussy about capitalisation!
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1 and
higher levels usually makes far less difference. The processors
themselves do a good job of things like instruction scheduling and
register renaming at runtime, and are designed to be good for running
weakly optimised code. I find it makes a bigger difference on
processors that don't do as much at run-time, such as microcontroller
cores.
But the "-march=native" can make a very big difference, especially if
it means the compiler can use SIMD or other advanced instructions.
(You don't need "-mtune" if you have "march", as it is implied by
"-march" - you only need both if you want to make a build that will
run on many x86-64 variants but is optimised for one particular one.)
And the "-march=native" benefits go well with some of the higher
optimisations - thus it is good to combine "-march=native" with
"-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of
program.
mtune=native _does_ make a difference. Somewhat to my surprise it makes
it _slower_ - the biggest difference being about 5% with a size of 128k UINT64.
I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too) Ofast (mostly slightly faster than O3 when I use a capital O) and O3 march=native.
The pipeline thing made me try something different - instead of
incrementing a value I used std::copy to copy each word to the next one.
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It
then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
Andy
On 25/12/2023 14:56, David Brown wrote:
On 23/12/2023 23:50, Thomas Koenig wrote:
Vir Campestris <vir.campestris@invalid.invalid> schrieb:
On 22/12/2023 21:58, Theo wrote:
<snip>
in a VM
which seems to have more flutter in group A. I get similar
results
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.Thank you for those. Neither of them show the odd thing I was
Theo
seeing - but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
YKWIM :P
Some programs are really fussy about capitalisation!
-march=native -mtune=native might make more of a difference,
depending on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1
and higher levels usually makes far less difference. The
processors themselves do a good job of things like instruction
scheduling and register renaming at runtime, and are designed to be
good for running weakly optimised code. I find it makes a bigger difference on processors that don't do as much at run-time, such as microcontroller cores.
But the "-march=native" can make a very big difference, especially
if it means the compiler can use SIMD or other advanced
instructions. (You don't need "-mtune" if you have "march", as it
is implied by "-march" - you only need both if you want to make a
build that will run on many x86-64 variants but is optimised for
one particular one.) And the "-march=native" benefits go well with
some of the higher optimisations - thus it is good to combine "-march=native" with "-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of program.
mtune=native _does_ make a difference. Somewhat to my surprise it
makes it _slower_ - the biggest difference being about 5% with a size
of 128k UINT64.
I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster
too) Ofast (mostly slightly faster than O3 when I use a capital O)
and O3 march=native.
The pipeline thing made me try something different - instead of
incrementing a value I used std::copy to copy each word to the next
one.
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
It then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
Andy
On Wed, 27 Dec 2023 14:20:08 +0000
Vir Campestris <vir.campestris@invalid.invalid> wrote:
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
It then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
GB rather than MB, hopefully
Andy
Can you tell us what are you trying to achieve?
Is it a microbenchmark intended to improve your understanding of Zen1 internals?
Or part of the code that you want to run as fast as possible?
If the later, what exactly do you want to be done by the code?
It is a microbenchmark.
I was trying to understand cache performance on my system.
Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.
I and another poster are trying to optimise it. Not for any good reason
of course... but I was just curious about some of the results I was
getting.
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the right places! I'd be interested to know what results other people get.
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
Vir Campestris wrote:
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.
Built and run under Win11 with g++ 13.2.0
on a laptop with
i7-11370H CPU (3.3GHz base, 4.28GHz turbo)
L1 d-cache 48kB per core
L2 cache 1.2MB
L3 cache 12MB
16GB memory
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
and (perversely?) with -Ofast
Andy Burns wrote:{ A
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
and (perversely?) with -Ofast
Size 1 gave 4.64966e+09 bytes/second.} A
Size 2 gave 9.39481e+09 bytes/second.
Size 4 gave 1.59656e+10 bytes/second.
Size 8 gave 2.6789e+10 bytes/second.
Size 16 gave 4.42999e+10 bytes/second.
Size 32 gave 4.51076e+10 bytes/second.} B
Size 64 gave 4.90103e+10 bytes/second.
Size 128 gave 4.08414e+10 bytes/second.
Size 256 gave 4.36978e+10 bytes/second.
Size 512 gave 4.54119e+10 bytes/second.
Size 1024 gave 4.50594e+10 bytes/second.
Size 2048 gave 4.71622e+10 bytes/second.
Size 4096 gave 4.60261e+10 bytes/second.
Size 8192 gave 3.84764e+10 bytes/second.} C
Size 16384 gave 3.90878e+10 bytes/second.
Size 32768 gave 3.76718e+10 bytes/second.
Size 65536 gave 3.87339e+10 bytes/second.
Size 131072 gave 3.89058e+10 bytes/second.
Size 262144 gave 3.85662e+10 bytes/second.
Size 524288 gave 4.19209e+10 bytes/second.
Size 1048576 gave 3.00962e+10 bytes/second.} D
Size 2097152 gave 1.37616e+10 bytes/second.
Size 4194304 gave 1.4136e+10 bytes/second.
Size 8388608 gave 1.39413e+10 bytes/second.
Size 16777216 gave 1.35747e+10 bytes/second.
Size 33554432 gave 1.37286e+10 bytes/second.
Size 67108864 gave 1.38163e+10 bytes/second.
Size 134217728 gave 1.29987e+10 bytes/second.
Vir Campestris wrote:
It is a microbenchmark.
I was trying to understand cache performance on my system.
Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.
I and another poster are trying to optimise it. Not for any good
reason of course... but I was just curious about some of the results I
was getting.
A few years ago this guy popped-up on youtube, saying "I was the author
of Task Manager on Windows NT", I ignored the recommendation for quite a while thinking he just wanted people to blow smoke up his arse,
eventually I watched and he seems a decent guy ... anyway he did a drag racing series using sieve prime algorithm
<https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>
Others have contributed in all manner of languages
<https://github.com/PlummersSoftwareLLC/Primes>
long rfib(long x)
{ return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
On 12/30/2023 4:05 PM, MitchAlsup wrote:
BGB wrote:
  long rfib(long x)
    { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
At first I thought the following looked pretty good:
rfib:
    ENTER  R29,R0,#0         // just 3 preserved registers.
    CMP    R2,R1,#2
    PGE    R2,TT
    MOV    R1,#1             // return 1
    EXIT   R29,R0,#0
    SUB    R30,R1,#2
    SUB    R1,R1,#1
    CALL   rfib              // rfib(n-1)
    MOV    R29,R1
    MOV    R1,R30
    CALL   rfib              // rfib(n-2)
    ADD    R1,R1,R29
    EXIT   R29,R0,#0         // return rfib(n-1)+rfib(n-2)
Theoretically, could be done in my ISA something like:
rfib:
ADD -24, SP | MOV LR, R1
MOV 1, R2 | CMPGE 2, R4
BF .L0
MOV.X R12, (SP, 0)
MOV R4, R12 | MOV.Q R1, (SP, 16)
ADD R12, -1, R4
BSR rfib
MOV R2, R13 | ADD R12, -2, R4
BSR rfib
ADD R2, R13, R2
MOV.Q (SP, 16), R1
MOV.X (SP, 0), R12
.L0:
ADD 24, SP
JMP R1
But, my compiler isn't going to pull off anything like this.
Checking, actual BGBCC output:
rfib:
MOV LR, R1
BSR __prolog_0000_0000020000FC
ADD -208, SP
MOV RQ4, RQ13
// tk_shell_chksvar.c:234 { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
CMPQGE 2, RQ13
BT .L00802A62
MOV 1, RQ10
BRA .L00802A63
..L00802A62:
SUB RQ13, 1, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ12
SUB RQ13, 2, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ11
ADD RQ12, RQ11, RQ14
MOV RQ14, RQ10
..L00802A63:
MOV RQ10, RQ2
..L00C108CE:
ADD 208, SP
BRA __epilog_0000_0000020000FC
.balign 4
( Yes, this is more MOV's than "technically necessary", but eliminating
them from the compiler output is "easier said than done" given how some
parts of the compiler work. RQn/RDn are technically equivalent to Rn,
but the compiler distinguishes them as early on the idea was that the assembler might distinguish instructions based on type, like
EAX/RAX/AX/... on x86-64, but it didn't quite happen this way. )
And, fetching the prolog and epilog:
__prolog_0000_0000020000FC:
ADD -48, SP
MOV.Q R1, (SP, 40)
MOV.X R12, (SP, 16)
MOV.Q R14, (SP, 32)
MOV.X R10, (SP, 0)
RTS
__epilog_0000_0000020000FC:
MOV.Q (SP, 40), R1
MOV.X (SP, 0), R10
MOV.X (SP, 16), R12
MOV.Q (SP, 32), R14
ADD 48, SP
JMP R1
Or, a little closer to the final machine-code (BJX2 Baseline):
rfib: //@03E27C
4911 STC R1, R1
..reloc __prolog_0000_0000020000FC 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F2F1_D330 ADD -208, SP
18D4 MOV R4, R13
F2DA_C802 CMPQGE 2, R13
..reloc .L00802A62 0F/RELW20_BJX
E000_C000 BT (PC, 0)
DA01 MOV 1, R10
..reloc .L00802A63 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3000 NOP
..L00802A62: //@03E298
F2ED_11FF ADD R13, -1, R14
F04E_1089 MOV R14, R4
..reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F4C2_1089 MOV R2, R12 |
F2ED_11FE ADD R13, -2, R14
F04E_1089 MOV R14, R4
..reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F0B2_1089 MOV R2, R11
F0EC_10B0 ADD R12, R11, R14
F0AE_1089 MOV R14, R10
..L00802A63: //@03E2C0
182A MOV R10, R2
..L00C108CE: //@03E2C2
F2F0_D0D0 ADD 208, SP
..reloc __epilog_0000_0000020000FC 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3030 BRK
But then I moved save and restore to the rfib() calls making the last rung >> on the call tree less expensive by 6 memory references each, and by saving >> restoring only 2 registers rather than 3 at the cost of a MOV, so each non- >> leaf level in the call tree saves/restores only 4 registers instead of 6:: >>
rfib:
    SUB    R2,R1,#2
    PNLT   R2,TT             // (n-2) < 0
    MOV    R1,#1             // return 1
    RET
    ENTER  R30,R0,#0         // just 2 preserved registers.
    MOV    R30,R2
    SUB    R1,R1,#1
    CALL   rfib              // rfib(n-1)
    MOV    R2,R1
    MOV    R1,R30
    MOV    R30,R2
    CALL   rfib              // rfib(n-2)
    ADD    R1,R1,R30
    EXIT   R30,R0,#0         // return rfib(n-1)+rfib(n-2)
But then I realized that rfib(n-2) has already been computed by rfib(n-1). >> Since the size of the redundant element on the stack is constant, rfib(n-1) >> makes a location available to rfib(n-2) so only 1 call is required instead >> of 2::
rfib:
    ENTER  R0,R0,#8
    STD    #0,[SP+8]       // setup rfib[n-2]
    CALL   rfib1            // recurse through rfib1 >>     EXIT   R0,R0,#8
rfib1:
    SUB    R2,R1,#2
    PNLT   R2,TT             // (n-2) < 0
    MOV    R1,#1             // return 1
    RET
    ENTER  R0,R0,#8          // no preserved registers just return
address
    SUB    R1,R1,#1
    CALL   rfib1             // rfib(n-1)
    LDD    R2,[SP]           // save rfib(n-2)
    ADD    R1,R1,R30
    STD    R1,[SP+16]        // restore rfib(n-2)
    EXIT   R0,R0,#8          // return rfib(n-1)+rfib(n-2)
....
If you cache the intermediate results of the calculations, this entirely defeats its use as a benchmark...
Say, because it drops from ~ O(2^n) to ~ O(n)...
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
In fact, given that fib grows exponentially, so we only ever need to calculate it for small values of n, the only “bad†solution is the naive double recursive one.
BGB wrote:
  long rfib(long x)
    { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
At first I thought the following looked pretty good:
rfib:
    ENTER  R29,R0,#0         // just 3 preserved registers.
    CMP    R2,R1,#2
    PGE    R2,TT
    MOV    R1,#1             // return 1
    EXIT   R29,R0,#0
    SUB    R30,R1,#2
    SUB    R1,R1,#1
    CALL   rfib              // rfib(n-1)
    MOV    R29,R1
    MOV    R1,R30
    CALL   rfib              // rfib(n-2)
    ADD    R1,R1,R29
    EXIT   R29,R0,#0         // return rfib(n-1)+rfib(n-2)
But then I moved save and restore to the rfib() calls making the last rung
on the call tree less expensive by 6 memory references each, and by saving restoring only 2 registers rather than 3 at the cost of a MOV, so each non- leaf level in the call tree saves/restores only 4 registers instead of 6::
rfib:
    SUB    R2,R1,#2
    PNLT   R2,TT             // (n-2) < 0
    MOV    R1,#1             // return 1
    RET
    ENTER  R30,R0,#0         // just 2 preserved registers.
    MOV    R30,R2
    SUB    R1,R1,#1
    CALL   rfib              // rfib(n-1)
    MOV    R2,R1
    MOV    R1,R30
    MOV    R30,R2
    CALL   rfib              // rfib(n-2)
    ADD    R1,R1,R30
    EXIT   R30,R0,#0         // return rfib(n-1)+rfib(n-2)
But then I realized that rfib(n-2) has already been computed by rfib(n-1). Since the size of the redundant element on the stack is constant, rfib(n-1) makes a location available to rfib(n-2) so only 1 call is required instead
of 2::
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad†solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you need
to compute it often enough that efficiency might become relevant just
put the results in a table and use that. Given that fib goes over 10^12
after about 60 terms it needn't be a very big table.
The naive doubly recursive solution has the huge merit of being human-readable.haven't tested recently, but off the top of my head, the maximum
Yes, but you'd be surprised how quickly double recursion goes bad. I
Unlike, say:
 double fib( int n )
 {
     return (pow( (1+sqrt(5.0))/2, n+1 )
             - pow( (1-sqrt(5.0))/2, n+1 ))
         / sqrt(5.0);
 }
... which is great for fibs of big values, but not fast for small ones.
and (perversely?) with -Ofast
On 31/12/2023 15:45, Daniel James wrote:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad†solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
Memoization, is effectively a lookup table, but without the thinking. In
my case, avoiding thinking massively improves code reliability.
The naive doubly recursive solution has the huge merit of beinghaven't tested recently, but off the top of my head, the maximum
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
function call depth would be something like 300, so fib(10) would be a
stack overflow. fib(60) would need a stack function call depth ~ 1e18.
Unlike, say:
  double fib( int n )
  {
      return (pow( (1+sqrt(5.0))/2, n+1 )
              - pow( (1-sqrt(5.0))/2, n+1 ))
          / sqrt(5.0);
  }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be a
stack overflow. fib(60) would need a stack function call depth ~ 1e18.
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
An iterative O(n) version will take 2-4 clcok cycles per iteration, so
for n less than about 40, you'd only have ~100 clock cycles to evaluate
the two pow() calls. (I'm assuming the compiler or programmer have pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
1/sqrt(5), so that only the two pow() calls remain!)
You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be
a stack overflow. fib(60) would need a stack function call depth ~
1e18.
45, which is more of a problem.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small
ones.
Why isn't it fast for small fibs?
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
Pancho wrote:
Unlike, say:
  double fib( int n )
  {
      return (pow( (1+sqrt(5.0))/2, n+1 )
              - pow( (1-sqrt(5.0))/2, n+1 ))
          / sqrt(5.0);
  }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
An iterative O(n) version will take 2-4 clcok cycles per iteration, so
for n less than about 40, you'd only have ~100 clock cycles to evaluate
the two pow() calls.
(I'm assuming the compiler or programmer have pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
1/sqrt(5), so that only the two pow() calls remain!)
You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.
Terje
On 31/12/2023 17:03, Terje Mathisen wrote:
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
How many cycles for a branch mispredict? From real life, I have seen if statements be more expensve than pow, but it was a long time ago, and possibly code specific. It surprised me at the time, but ever since I
tend to ignore the cost of functions like pow().
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail) recursive. Which a smart compiler could theoretically convert to an iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a
very simple closed form solution.
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
recursive. Which a smart compiler could theoretically convert to an
iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a
very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
However, the (signed) 64-bit integer result overflows for fib(n) where
45, which is more of a problem.
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
do { // we know n >=2 so at least 1 path through the loop is performed.
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
} while( n > 2 );
return fn;
}
Daniel James wrote:
Unlike, say:
  double fib( int n )
  {
      return (pow( (1+sqrt(5.0))/2, n+1 )
              - pow( (1-sqrt(5.0))/2, n+1 ))
          / sqrt(5.0);
  }
Looks fast to me::
fib:
    ADD   R2,R1,#1
    POWN   R3,#(1+sqrt(5.0))/2,R2
    POWN   R4,#(1-sqrt(5.0))/2,R2
    FADD   R2,R3,R4
    FDIV   R1,R2,#sqrt(5)
    RET
Daniel James <daniel@me.invalid> writes:
However, the (signed) 64-bit integer result overflows for fib(n) where
45, which is more of a problem.
fib( 92 ) is still less than 1UL << 63 (or fib(93) for a 64-bit
unsigned type).
And fib( 186 ) still fits in a 128-bit unsigned type.
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
The naive doubly recursive solution has the huge merit of being human-readable.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones.
Thomas Koenig wrote:
#include <stdio.h>
unsigned long int fib (int n)
{
 unsigned long f1, f2, fn;
 if (n < 2)
   return 1;
 f1 = 1;
 f2 = 1;
 do { // we know n >=2 so at least 1 path through the loop is performed. >>    fn = f2 + f1;
   f1 = f2;
   f2 = fn;
   n--;
 } while( n > 2 );
 return fn;
}
Saves 1 cycle and and at least 1 instruction of code footprint.
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
recursive. Which a smart compiler could theoretically convert to an
iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a
very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
I don't think that
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
is particularly intellecually challenging.
Daniel James <daniel@me.invalid> writes:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
When the size of the table is more than ten times the size of
the code, and the code is reasonably fast, using a table is a
poor choice. Fibonacci functions don't have to be slow.
The naive doubly recursive solution has the huge merit of being
human-readable.
But unacceptably slow.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones.
Unacceptable for fibs of big values, because the values computed
are wrong.
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
recursive. Which a smart compiler could theoretically convert to an
iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a >>>> very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
I don't think that
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
is particularly intellecually challenging.
I'm sure there are other people who feel the same way.
Even so, when a short, functional, recursive version
is available, it is often easier to write, easier to
read, and simpler to show that it produces correct
results, in which case doing that is a better choice,
even if an imperative version is not particularly
intellecually challenging.
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
Thomas Koenig wrote:
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
do { // we know n >=2 so at least 1 path through the loop is performed.
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
} while( n > 2 );
return fn;
}
Saves 1 cycle and and at least 1 instruction of code footprint.
On 31/12/2023 16:40, Pancho wrote:
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be
a stack overflow. fib(60) would need a stack function call depth ~
1e18.
On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
without blowing the stack, but it takes minutes (on a Ryzen 7).
However, the (signed) 64-bit integer result overflows for fib(n) where
45, which is more of a problem.
Unlike, say:
  double fib( int n )
  {
      return (pow( (1+sqrt(5.0))/2, n+1 )
              - pow( (1-sqrt(5.0))/2, n+1 ))
          / sqrt(5.0);
  }
... which is great for fibs of big values, but not fast for small
ones.
Why isn't it fast for small fibs?
Gut feeling. I haven't timed it, and it may be.
I mean ... obviously it will be as fast for small ones as for large, but
for small ones it won't beat an iterative calculation and for very small
ones I doubt it will beat the naive recursive method.
My point, though, was that clear code is often more important than fast execution, and while the naive recursive method for fibs gets horribly
slow for large values even it is acceptable for small ones.
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
if statements are costly if they invalidate instruction prefetch and/or
jump to code that isn't in the cache ... but optimizers and CPUs are
pretty good at branch prediction and cacheing.
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
recursive. Which a smart compiler could theoretically convert to an
iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a >>>> very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
I don't think that
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
is particularly intellecually challenging.
I'm sure there are other people who feel the same way.
Even so, when a short, functional, recursive version
is available, it is often easier to write, easier to
read, and simpler to show that it produces correct
results, in which case doing that is a better choice,
even if an imperative version is not particularly
intellecually challenging.
Tim Rentsch wrote:[...]
Thomas Koenig <tkoenig@netcologne.de> writes:
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail) >>>>> recursive. Which a smart compiler could theoretically convert to an >>>>> iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a >>>>> very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
I don't think that
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
is particularly intellecually challenging.
I'm sure there are other people who feel the same way.
Even so, when a short, functional, recursive version
is available, it is often easier to write, easier to
read, and simpler to show that it produces correct
results, in which case doing that is a better choice,
even if an imperative version is not particularly
intellecually challenging.
Unless, of course, it runs in exponential time, as the
primitive version with the recurrence does.
Compilers often do not do a good job of replacing recursion
with iteration (is there a verb for that? Maybe it can be called iterationizing?), so it is something to consider if an algorithm
is at all time critical or stack space is problematic.
Compilers often do not do a good job of replacing recursion
with iteration (is there a verb for that? Maybe it can be called iterationizing?), so it is something to consider if an algorithm
is at all time critical or stack space is problematic.
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
} while( n > 2 );
(This should be n >= 2)
Tim Rentsch wrote:
Daniel James <daniel@me.invalid> writes:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
When the size of the table is more than ten times the size of
the code, and the code is reasonably fast, using a table is a
poor choice. Fibonacci functions don't have to be slow.
The naive doubly recursive solution has the huge merit of being
human-readable.
But unacceptably slow.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones.
Unacceptable for fibs of big values, because the values computed
are wrong.
So either you move to quad or arbitrary precision (which will of course
be somewhat slow, or you use a hybrid technique, with lookup tables for smallish n, iterative loop for somewhat larger and your O(log(n))
approach only for very large n?
Working with 256-bit uints, each addition will take 4-8 clock cycles,
and it overflows well before n reaches 400, right? In that case a worst
case iterative solution is less than 3000 clock cycles or about a
microsecond away.
Terje
MitchAlsup wrote:
Thomas Koenig wrote:
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
} while( n > 2 );
return fn;
}
Saves 1 cycle and and at least 1 instruction of code footprint.
I have written asm code for this, and noted that on x86 where reg
MOVes used to cost a cycle, it was better to unroll and get rid of the reg-reg moves.
I think the following should work:
u64 fib(unsigned n)
{
if (n-- < 2) return 1;
a = 1; b = 1;
a += (n & 1); // Odd remaining n
n >>= 1;
while (n--) {
b += a;
a += b;
}
return a;
}
On 01/01/2024 15:00, Daniel James wrote:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small
ones.
From a mathematical viewpoint, the closed form solution gives clearer behaviour.
Tim Rentsch wrote:
Daniel James <daniel@me.invalid> writes:
However, the (signed) 64-bit integer result overflows for fib(n) where
45, which is more of a problem.
fib( 92 ) is still less than 1UL << 63 (or fib(93) for a 64-bit
unsigned type).
And fib( 186 ) still fits in a 128-bit unsigned type.
If calculating fib(n) for n in the fixed 0..186 range is actually
important to you, then you would obviously allocate 16*187 (about 3KB)
bytes of table space and do a simple lookup instead.
Tim Rentsch wrote:
Daniel James <daniel@me.invalid> writes:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
When the size of the table is more than ten times the size of
the code, and the code is reasonably fast, using a table is a
poor choice. Fibonacci functions don't have to be slow.
The naive doubly recursive solution has the huge merit of being
human-readable.
But unacceptably slow.
Unlike, say:
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones.
Unacceptable for fibs of big values, because the values computed
are wrong.
So either you move to quad or arbitrary precision (which will of
course be somewhat slow,
or you use a hybrid technique, with lookup
tables for smallish n, iterative loop for somewhat larger and your
O(log(n)) approach only for very large n?
Working with 256-bit uints, each addition will take 4-8 clock cycles,
and it overflows well before n reaches 400, right? In that case a
worst case iterative solution is less than 3000 clock cycles or about
a microsecond away.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
In situations like this one, usually I would favor something
more like this:
if( n < 8 ) return 0xd8532110u >> n*4 & 15;
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Pancho <Pancho.Jones@proton.me> writes:
On 30/12/2023 23:19, BGB wrote:
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.
1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail) >>>>>> recursive. Which a smart compiler could theoretically convert to an >>>>>> iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a >>>>>> very simple closed form solution.
All of these approaches have drawbacks of one sort or another.
For (1), we end up writing more code and using more memory.
Some problems are a good fit to memoization, but computing
Fibonacci numbers isn't one of the.
For (2), compilers are not so good at dealing with tail
recursion when the return value is a pair of results
rather than a single result.
For (3), iterative code takes more mental effort to see
that it works than does a simple recursive version.
I don't think that
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
is particularly intellecually challenging.
I'm sure there are other people who feel the same way.
Even so, when a short, functional, recursive version
is available, it is often easier to write, easier to
read, and simpler to show that it produces correct
results, in which case doing that is a better choice,
even if an imperative version is not particularly
intellecually challenging.
Unless, of course, it runs in exponential time, as the
primitive version with the recurrence does.
Sure. Implicit in my comment is that the alternatives being
considered are algorithmically comparable. Otherwise it's
comparing apples and oranges.
Compilers often do not do a good job of replacing recursion
with iteration (is there a verb for that? Maybe it can be called
iterationizing?), so it is something to consider if an algorithm
is at all time critical or stack space is problematic.
Of course. I take it as given that recursive functions that use
an actual recursive call are unacceptable, except perhaps in
special cases, as for example quicksort. I tend to use functional
recursion and mutual recursion a fair amount, but not in cases
where an actual call is done in the recursive cycle.
By the way a term that may satisfy your question is "tail call
optimization".
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
#include <stdio.h>
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
} while( n > 2 );
(This should be n >= 2)
return fn;
}
Saves 1 cycle and and at least 1 instruction of code footprint.
Fortunately, this is one of the transformations which compilers
do automatically these days. For both versions, the code for
fib generated by a recent gcc on x86_64 with -O3 is
fib:
.LFB11:
.cfi_startproc
movl $1, %edx
cmpl $1, %edi
jle .L1
movl $1, %eax
movl $1, %ecx
jmp .L3
.p2align 4,,10
.p2align 3
.L5:
movq %rdx, %rax
.L3:
subl $1, %edi
leaq (%rcx,%rax), %rdx
movq %rax, %rcx
cmpl $1, %edi
jne .L5
.L1:
movq %rdx, %rax
ret
which looks pretty good.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:[...]
Thomas Koenig <tkoenig@netcologne.de> writes:
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
[...]
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
In situations like this one, usually I would favor something
more like this:
if( n < 8 ) return 0xd8532110u >> n*4 & 15;
Something like
if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;
Terje Mathisen wrote:
Something like
  if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
(60-n*6) & 63;
unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,Very nice except for not using the unroll-by-two, started with the two
233 };
unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };
fib( unsigned n )
{
   if( n <= 14 ) return bytes[n];
   if( n <= 25 ) return half[n-14];
   return fib( n-1 ) + fib( n-2 );
}
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
In situations like this one, usually I would favor something
more like this:
if( n < 8 ) return 0xd8532110u >> n*4 & 15;
Which is::
CMP R2,R1,#8
PLT R2,TTT
LEA R2,[IP,R1<<2,0xF00000000-.]
SLL R1,0xd853110,R2
RET
....
Terje Mathisen wrote:
Something like
if (n < 11) return
((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;
unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
10946, 17711, 28657, 46368 };
fib( unsigned n )
{
if( n <= 14 ) return bytes[n];
if( n <= 25 ) return half[n-14];
return fib( n-1 ) + fib( n-2 );
}
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
[...]
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
In situations like this one, usually I would favor something
more like this:
if( n < 8 ) return 0xd8532110u >> n*4 & 15;
Nice!
Could be extended a little bit more with a lookup table, or to 16
values with a SIMD in-register lookup operation.
Using a 64-bit
constant allows 11 values to fit inside 6-bit sub-containers, as long
as they are stored in reverse order so that the bottom value (0) can
survive without the top two bits.
Something like
if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;
Anyway, this is code golfing and not a realistic scenario where you
would definitely use a direct lookup table for all the possible
values.
unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
10946, 17711, 28657, 46368 };
fib( unsigned n )
{
if( n <= 14 ) return bytes[n];
if( n <= 25 ) return half[n-14];
uint64_t a = half[9], b = half[10];
for( n -= 25; n; n-- )
{
b += a;
a += b;
}
return a;
}
unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,
233 };
unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };
fib( unsigned n )
{
   if( n <= 14 ) return bytes[n];
   if( n <= 25 ) return half[n-14];
   uint64_t a = half[9], b = half[10];
   for( n -= 25; n; n-- )
   {
        b += a;
        a += b;
   }
   return a;
}
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
unsigned long int fib (int n)
{
unsigned long f1, f2, fn;
if (n < 2)
return 1;
f1 = 1;
f2 = 1;
while (n >= 2) {
fn = f2 + f1;
f1 = f2;
f2 = fn;
n--;
}
return fn;
}
[...]
Incidentally your code produces wrong values, as
fib(0) is 0 and fib(1) is 1.
So return n for (n<2)?
In situations like this one, usually I would favor something
more like this:
if( n < 8 ) return 0xd8532110u >> n*4 & 15;
Nice!
Could be extended a little bit more with a lookup table, or to 16
values with a SIMD in-register lookup operation.
No reason to provide more values. It turns out these initial 8
values are just what the doctor ordered as a lead in to my latest
iterative version. In earlier versions putting in this special
case hurt more than it helped.
Using a 64-bit
constant allows 11 values to fit inside 6-bit sub-containers, as long
as they are stored in reverse order so that the bottom value (0) can
survive without the top two bits.
Something like
if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
(60-n*6) & 63;
Probably more clever than useful. If I were doing this most likely
I would write the constant in octal:
0001010203051015254267u >> (60-6*n) & 63
Note by the way that you left off the value 55, which puts the
values 21 and 34 in the wrong places.
Anyway, this is code golfing and not a realistic scenario where you
would definitely use a direct lookup table for all the possible
values.
A full lookup table is too doggone big. Also part of the point
of using the constant shift method is not to access data memory,
as that has other costs (and which are unlikely to show up in
speed measurements). The computational form is plenty fast
enough (worst case time between 7 and 8 nanoseconds in my tests).
This technique of packing several small values into a constant
and shifting to get a specific result is useful in lots of
situations. For example, if we are writing a function to test
for primality, it can start with
if( n < 64 ) return 0240404242024042424254 >>n &1;
for a fast test on small integers.
Since Tim is using a 4-way unroll, I believe he would be loading 4
values from a table of at least 7 entries to handle loop alignement.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Daniel James <daniel@me.invalid> writes:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to >>>>> calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as >>>> you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over >>>> 10^12 after about 60 terms it needn't be a very big table.
When the size of the table is more than ten times the size of
the code, and the code is reasonably fast, using a table is a
poor choice. Fibonacci functions don't have to be slow.
The naive doubly recursive solution has the huge merit of being
human-readable.
But unacceptably slow.
Unlike, say:Unacceptable for fibs of big values, because the values computed
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones. >>>
are wrong.
So either you move to quad or arbitrary precision (which will of
course be somewhat slow,
To me, what makes floating point unattractive in this case is
great care must be taken to ensure correct values are produced.
Add to that the increased time cost, and using floating point
just looks like a lot more downside than upside.
or you use a hybrid technique, with lookup
tables for smallish n, iterative loop for somewhat larger and your
O(log(n)) approach only for very large n?
Oh, you remembered I wrote an O(log(n)) fibonacci function. That
is nice. :)
Based on what I know now, my inclination is to use the constant-shift technique (as seen in another post) for n < 8, the O(log(n)) method
for results that will be more than 64 bits, and a small-constant-linear method in between those two regions. (Incidentally, the intermediate
method is something I had already written, and might be seen as a generalization of your two-fold speedup code. You might want to take
a look at enhancing the two-fold speedup idea so it does at most one of those, and then after that gets a four-fold speedup. The best
performing code I have for 64 bits goes two steps farther, so all the iterations after the "startup transient" are 16-fold speedup.)
Working with 256-bit uints, each addition will take 4-8 clock cycles,
and it overflows well before n reaches 400, right? In that case a
worst case iterative solution is less than 3000 clock cycles or about
a microsecond away.
By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit
results. At a guess, 256-bit results would run about 5 to 10
times slower than that.
By the way, your two-fold speedup code runs about 1.6 times as
fast as using simple iteration.
Tim Rentsch wrote:[...]
This technique of packing several small values into a constant
and shifting to get a specific result is useful in lots of
situations. For example, if we are writing a function to test
for primality, it can start with
if( n < 64 ) return 0240404242024042424254 >>n &1;
for a fast test on small integers.
[...]
Your 64-bit bitmap here is (inversely) related to the modulo 30
packing I'm using in my sieve, where the 8 possible prime
locations in each block of 30 numbers will fit in a byte.
Terje Mathisen wrote:
Since Tim is using a 4-way unroll, I believe he would be loading 4
values from a table of at least 7 entries to handle loop alignement.
I took another look at this, and it looks pretty nice to use 4
starting entries (a..d in increasing order), then a two-cycle
recurrence:
a = c+d
b = c+2*d
c = a+b
d = a+2*b
which every LEA-capable compiler around could convert into
lea rax,[rcx+rdx]
lea rbx,[rcx+2*rdx]
lea rcx,[rax+rbx]
lea rdx,[rax+2*rbx]
The pairs of instructions can run at the same time, so this can run in
two cycles as long as back-to-back LEA can run in subsequent cycles.
To run even faster than this I seem to need actual multiplications,
with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1)
f(n+2) = 2*f(n)+f(n-1)
f(n+3) = 3*f(n)+2*f(n-1)
f(n+4) = 5*f(n)+3*(f(n-1)
...
f(n+8) = 34*(f(n)+21*f(n-1)
Tim Rentsch wrote:
By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit
That is impressive!
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:[...]
This technique of packing several small values into a constant
and shifting to get a specific result is useful in lots of
situations. For example, if we are writing a function to test
for primality, it can start with
if( n < 64 ) return 0240404242024042424254 >>n &1;
for a fast test on small integers.
[...]
Your 64-bit bitmap here is (inversely) related to the modulo 30
packing I'm using in my sieve, where the 8 possible prime
locations in each block of 30 numbers will fit in a byte.
I wrote one of these about 20 years, and recently was motivated
to write one again. I'm curious - have you tried running your
sieve for all primes less than a trillion (1e12)?
To run even faster than this I seem to need actual multiplications, with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1)
f(n+2) = 2*f(n)+f(n-1)
f(n+3) = 3*f(n)+2*f(n-1)
f(n+4) = 5*f(n)+3*(f(n-1)
....
f(n+8) = 34*(f(n)+21*f(n-1)
Terje Mathisen wrote:t = a*table[i ] + b*table[i+1];
To run even faster than this I seem to need actual multiplications, with
small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1)
f(n+2) = 2*f(n)+f(n-1)
f(n+3) = 3*f(n)+2*f(n-1)
f(n+4) = 5*f(n)+3*(f(n-1)
....
f(n+8) = 34*(f(n)+21*f(n-1)
Combining::
unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
144, 233, 377, 610, 987, 1597, 2584 };
fib( unsigned n )
{
i = n & 15;
a = table[i ];
b = table[i+1];
for( n -= i; n ; n -= 16 )
{
i = n & 15;
b = a*table[i+1] + b*table[i+2];a = t;
}
return a;
}
Knocking off 16 fibs per loop.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Terje Mathisen wrote:
Since Tim is using a 4-way unroll, I believe he would be loading
4 values from a table of at least 7 entries to handle loop
alignement.
I took another look at this, and it looks pretty nice to use 4
starting entries (a..d in increasing order), then a two-cycle
recurrence:
a = c+d b = c+2*d c = a+b d = a+2*b
which every LEA-capable compiler around could convert into
lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
The pairs of instructions can run at the same time, so this can run
in two cycles as long as back-to-back LEA can run in subsequent
cycles.
To run even faster than this I seem to need actual
multiplications, with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
34*(f(n)+21*f(n-1)
As much as I am tempted to post the actual code, let me start with
something more like a hint, which I fully expect you can carry
forward to do the whole deal.
(forward by 0) a b (forward by 1) b a+b (forward
by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
5a+8b 8a+13b
and so forth.
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Terje Mathisen wrote:
Since Tim is using a 4-way unroll, I believe he would be loading
4 values from a table of at least 7 entries to handle loop
alignement.
I took another look at this, and it looks pretty nice to use 4
starting entries (a..d in increasing order), then a two-cycle
recurrence:
a = c+d b = c+2*d c = a+b d = a+2*b
which every LEA-capable compiler around could convert into
lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
The pairs of instructions can run at the same time, so this can run
in two cycles as long as back-to-back LEA can run in subsequent
cycles.
To run even faster than this I seem to need actual
multiplications, with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
34*(f(n)+21*f(n-1)
As much as I am tempted to post the actual code, let me start with
something more like a hint, which I fully expect you can carry
forward to do the whole deal.
(forward by 0) a b (forward by 1) b a+b (forward
by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
5a+8b 8a+13b
and so forth.
Isn't that exactly the same as what I posted just above, i.e. that
subsequent levels require pairs of multiplications by smaller fib
numbers?
All these muls are of course independent, but since we only have a
very small number of multipliers in current CPUS, we cannot expect to
do more than a pair of them in each cycle (all with latency 3-5).
Doing it with adds instead means that we have to generate a,2*a,3*a,5*a,8*a,13*a etc.
We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
gets us to the next 5 fib numbers all done in parallel and at latency
1 or 2:
All these can be done in parallel on a 5-wide CPU:
a5 = LEA(a,a*4);
b3 = LEA(b,b*2); b5 = LEA(b,b*4);
f(n) = a+b;
f(n+1) = LEA(b+a*2);
These next operations can run in the second cycle, so we've generated
5 new fib() numbers in two cycles:
f(n+2) = f(n)+f(n+1);
f(n+3) = a5+b3;
f(n+4) = LEA(b5+a*8);
In the same cycle we can generate a couple more aggregate multiples:
a13 = LEA(a5,a*8); b13=LEA(b5,b*8);
and then we get to spew out the 6th fib value in the third cycle
f(n+5) = LEA(a13,b*8);
but I get the feel that it is better to stay at 4 or 5 new numbers in
each step?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Terje Mathisen wrote:
Since Tim is using a 4-way unroll, I believe he would be loading
4 values from a table of at least 7 entries to handle loop
alignement.
I took another look at this, and it looks pretty nice to use 4
starting entries (a..d in increasing order), then a two-cycle
recurrence:
a = c+d b = c+2*d c = a+b d = a+2*b
which every LEA-capable compiler around could convert into
lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]
lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]
The pairs of instructions can run at the same time, so this can run
in two cycles as long as back-to-back LEA can run in subsequent
cycles.
To run even faster than this I seem to need actual
multiplications, with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
34*(f(n)+21*f(n-1)
As much as I am tempted to post the actual code, let me start with
something more like a hint, which I fully expect you can carry
forward to do the whole deal.
(forward by 0) a b (forward by 1) b a+b (forward
by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
5a+8b 8a+13b
and so forth.
Isn't that exactly the same as what I posted just above, i.e. that
subsequent levels require pairs of multiplications by smaller fib
numbers?
I didn't really follow what you were saying, but I'm pretty
sure the two aren't the same, because I keep only two values,
a and b, no matter how many levels of x-fold-speedup is done.
All these muls are of course independent, but since we only have a
very small number of multipliers in current CPUS, we cannot expect to
do more than a pair of them in each cycle (all with latency 3-5).
Doing it with adds instead means that we have to generate
a,2*a,3*a,5*a,8*a,13*a etc.
We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
gets us to the next 5 fib numbers all done in parallel and at latency
1 or 2:
All these can be done in parallel on a 5-wide CPU:
a5 = LEA(a,a*4);
b3 = LEA(b,b*2); b5 = LEA(b,b*4);
f(n) = a+b;
f(n+1) = LEA(b+a*2);
These next operations can run in the second cycle, so we've generated
5 new fib() numbers in two cycles:
f(n+2) = f(n)+f(n+1);
f(n+3) = a5+b3;
f(n+4) = LEA(b5+a*8);
In the same cycle we can generate a couple more aggregate multiples:
a13 = LEA(a5,a*8); b13=LEA(b5,b*8);
and then we get to spew out the 6th fib value in the third cycle
f(n+5) = LEA(a13,b*8);
but I get the feel that it is better to stay at 4 or 5 new numbers in
each step?
In the code that follows, b is the current fibonacci number, and a
is the previous fibonacci number.
U64
fib_example( unsigned n ){
U64 b = n&1, a = b^1;
if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
U64 na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}
n >>= 4;
while( n-- ){
U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}
return b;
}
Note that fib(8) is 21, and fib(16) is 987.
It should be easy to see how to modify this code to extend it one
level and get a 32-fold speedup for the last stage, which does in
fact give a significant performance improvement.
Also I have no doubt that you can see ways to make the code run
even faster. I'm interested to see what you come up with, with
the stipulation that no lookup tables be used.
MitchAlsup wrote:
Terje Mathisen wrote:
To run even faster than this I seem to need actual multiplications,
with small fibonacci numbers as the multipliers:
f(n+1) = f(n)+f(n-1)
f(n+2) = 2*f(n)+f(n-1)
f(n+3) = 3*f(n)+2*f(n-1)
f(n+4) = 5*f(n)+3*(f(n-1)
....
f(n+8) = 34*(f(n)+21*f(n-1)
Combining::
unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
144, 233, 377, 610, 987, 1597, 2584 };
fib( unsigned n )
{
i = n & 15;
a = table[i ];
b = table[i+1];
for( n -= i; n ; n -= 16 )
{
i = n & 15;
t = a*table[i ] + b*table[i+1];
b = a*table[i+1] + b*table[i+2];
a = t;
On 1/2/2024 2:08 AM, Tim Rentsch wrote:
Daniel James <daniel@me.invalid> writes:
On 31/12/2023 09:12, Pancho wrote:
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only ?bad? solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
When the size of the table is more than ten times the size of
the code, and the code is reasonably fast, using a table is a
poor choice. Fibonacci functions don't have to be slow.
The naive doubly recursive solution has the huge merit of being
human-readable.
But unacceptably slow.
But, say, if one wanted something less slow, say (untested):
long fasterfib(int n)
{
long c0, c1, c2;
if(n<2)
return(1);
c0=1;
c1=1;
c2=c0+c1;
while(n>2)
{
c0=c1;
c1=c2;
c2=c0+c1;
n--;
}
return(c2);
}
Tim Rentsch wrote:
In the code that follows, b is the current fibonacci number, and a
is the previous fibonacci number.
U64
fib_example( unsigned n ){
U64 b = n&1, a = b^1;
if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
U64 na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}
n >>= 4;
while( n-- ){
U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}
return b;
}
Note that fib(8) is 21, and fib(16) is 987.
You are skipping intermediate numbers!
Still very nice idea. :-)
I also vaguely recall reading at some point that even
recursion that was "easy" to convert to tail recursion was
not handled by gcc (also many years ago).
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
I also vaguely recall reading at some point that even
recursion that was "easy" to convert to tail recursion was
not handled by gcc (also many years ago).
What looks like a tail-call to a Lisp or Scheme programmer actually is
not when it comes to C. That's because in C calling conventions, the
caller is responsible for cleaning up the stack, and this aspect of
the calling conventions is necessitated by the combination of varargs
and the optionality of function declarations.
So when in a C program you have
...
return foo(...);
}
in assembly language this is:
call foo
add $96, %rsp ;or something like that
return
and tail-call elimination of that is not possible in general.
gcc has something called sib-call elimination ("sib" for sibling), and AFAIK
this includes tail-recursion elimination, but I expect that even that
does not work at all times, e.g., because a pointer is passed to the
called function, and the C compiler cannot prove that it does not
point to something that is on the stack frame of the caller.
- anton
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
I also vaguely recall reading at some point that even
recursion that was "easy" to convert to tail recursion was
not handled by gcc (also many years ago).
What looks like a tail-call to a Lisp or Scheme programmer actually is
not when it comes to C. That's because in C calling conventions, the
caller is responsible for cleaning up the stack, and this aspect of
the calling conventions is necessitated by the combination of varargs
and the optionality of function declarations.
So when in a C program you have
...
return foo(...);
}
in assembly language this is:
call foo
add $96, %rsp ;or something like that
return
and tail-call elimination of that is not possible in general. gcc has >something called sib-call elimination ("sib" for sibling), and AFAIK
this includes tail-recursion elimination, but I expect that even that
does not work at all times, e.g., because a pointer is passed to the
called function, and the C compiler cannot prove that it does not
point to something that is on the stack frame of the caller.
- anton
On Tue, 09 Jan 2024 07:41:51 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:
So when in a C program you have
...
return foo(...);
}
in assembly language this is:
call foo
add $96, %rsp ;or something like that
return
and tail-call elimination of that is not possible in general. gcc has >>something called sib-call elimination ("sib" for sibling), and AFAIK
this includes tail-recursion elimination, but I expect that even that
does not work at all times, e.g., because a pointer is passed to the
called function, and the C compiler cannot prove that it does not
point to something that is on the stack frame of the caller.
- anton
You can tail-call and continuation-pass in C with the
Cheney-on-the-MTA technique.
https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html
TLDR; the idea is to bookmark the stack with setjmp(), allow it to
grow until some high-water mark is passed, then longjmp() to release
it all at once.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 02:45:08 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,755 |