• Re: AMD CPU funny

    From Theo@21:1/5 to Vir Campestris on Wed Dec 20 18:08:47 2023
    XPost: uk.comp.homebuilt

    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some ideas.

    For reasons I won't go into I've been writing some code to evaluate
    memory performance on my AMD Ryzen 5 3400G.

    It says in the stuff I've found that each core has an 8-way set
    associative L1 data cache of 128k (and an L1 instruction cache); an L2
    cache of 512k, also set associative; and there's an L3 cache of 4MB.

    To measure the performance I have three nested loops.

    The innermost one goes around a loop incrementing a series of 64 bit
    memory locations. The length of the series is set by the outermost loop.

    The middle one repeats the innermost loop so that the number of memory accesses is constant regardless of the series length.

    The outermost one sets the series length. It starts at 1, and doubles it
    each time.

    I _thought_ what would happen is that as I increase the length of the
    series after a while the data won't fit in the cache, and I'll see a
    sudden slowdown.

    What I actually see is:

    With a series length of 56 to 128 bytes I get the highest speed.

    With a series length of 500B to 1.5MB, I get a consistent speed of about
    2/3 the highest speed.

    Once the series length exceeds 1.5MB the speed drops, and is consistent
    from then on. That I can see is main memory speed, and is about 40% of
    the highest.

    OK so far.

    But...
    Series length 8B is about the same as the 56 to 128 speed. Series length
    16B is a bit less. Series length 32 is a lot less. Not as slow as main memory, but not much more than half the peak speed. My next step up is
    the peak speed. Series length 144 to 448 is slower still - slower in
    fact than the main memory speed.

    WTF?

    I can post the code (C++, but not very complex) if that would help.

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
    bit word transferred?

    What instruction sequences are being generated for the 8/16/32/64 byte
    loops? I'm wondering if the compiler is using different instructions,
    eg using MMX, SSE, AVX to do the operations. Maybe they are having
    different caching behaviour?

    It would help if you could tell us the compiler and platform you're using, including version.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Theo on Wed Dec 20 18:58:35 2023
    XPost: uk.comp.homebuilt

    Theo wrote:

    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some ideas.

    For reasons I won't go into I've been writing some code to evaluate
    memory performance on my AMD Ryzen 5 3400G.

    It says in the stuff I've found that each core has an 8-way set
    associative L1 data cache of 128k (and an L1 instruction cache); an L2
    cache of 512k, also set associative; and there's an L3 cache of 4MB.

    To measure the performance I have three nested loops.

    The innermost one goes around a loop incrementing a series of 64 bit
    memory locations. The length of the series is set by the outermost loop.

    The middle one repeats the innermost loop so that the number of memory
    accesses is constant regardless of the series length.

    The outermost one sets the series length. It starts at 1, and doubles it
    each time.

    I _thought_ what would happen is that as I increase the length of the
    series after a while the data won't fit in the cache, and I'll see a
    sudden slowdown.

    What I actually see is:

    With a series length of 56 to 128 bytes I get the highest speed.

    With a series length of 500B to 1.5MB, I get a consistent speed of about
    2/3 the highest speed.

    Once the series length exceeds 1.5MB the speed drops, and is consistent
    from then on. That I can see is main memory speed, and is about 40% of
    the highest.

    OK so far.

    But...
    Series length 8B is about the same as the 56 to 128 speed. Series length
    16B is a bit less. Series length 32 is a lot less. Not as slow as main
    memory, but not much more than half the peak speed. My next step up is
    the peak speed. Series length 144 to 448 is slower still - slower in
    fact than the main memory speed.

    WTF?

    I can post the code (C++, but not very complex) if that would help.

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
    bit word transferred?

    What instruction sequences are being generated for the 8/16/32/64 byte
    loops? I'm wondering if the compiler is using different instructions,
    eg using MMX, SSE, AVX to do the operations. Maybe they are having
    different caching behaviour?

    It would help if you could tell us the compiler and platform you're using, including version.

    Theo


    Can we see the code ??

    Can you present a table of the timing results ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Theo on Thu Dec 21 14:11:12 2023
    XPost: uk.comp.homebuilt

    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some ideas.

    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
    bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    What instruction sequences are being generated for the 8/16/32/64 byte
    loops? I'm wondering if the compiler is using different instructions,
    eg using MMX, SSE, AVX to do the operations. Maybe they are having
    different caching behaviour?

    It's running the same loop for each time, but with different values for
    the loop sizes.

    It would help if you could tell us the compiler and platform you're using, including version.


    g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    Copyright (C) 2021 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    Which of course tells you I'm running Ubuntu!

    On 20/12/2023 18:58, MitchAlsup wrote:

    Can we see the code ??

    Can you present a table of the timing results ??

    I've run this with more detailed increments on the line size, but here
    are my results for powers of 2.

    Size 1 gave 3.82242e+09 bytes/second.
    Size 2 gave 3.80533e+09 bytes/second.
    Size 4 gave 2.68017e+09 bytes/second.
    Size 8 gave 2.33751e+09 bytes/second.
    Size 16 gave 2.18424e+09 bytes/second.
    Size 32 gave 2.10243e+09 bytes/second.
    Size 64 gave 1.99371e+09 bytes/second.
    Size 128 gave 1.98475e+09 bytes/second.
    Size 256 gave 2.01653e+09 bytes/second.
    Size 512 gave 2.00884e+09 bytes/second.
    Size 1024 gave 2.02713e+09 bytes/second.
    Size 2048 gave 2.01803e+09 bytes/second.
    Size 4096 gave 3.26472e+09 bytes/second.
    Size 8192 gave 3.85126e+09 bytes/second.
    Size 16384 gave 3.85377e+09 bytes/second.
    Size 32768 gave 3.85293e+09 bytes/second.
    Size 65536 gave 2.06793e+09 bytes/second.
    Size 131072 gave 2.06845e+09 bytes/second.


    The code will continue, but the results are roughly stable for larger sizes.

    The code I have put in a signature block; there's no point in risking
    someone posting it again. I've commented it, but no doubt not in all the
    right places! I'd be interested to know what results other people get.

    Thanks
    Andy
    --
    #include <chrono>
    #include <iostream>
    #include <vector>

    int main()
    {
    // If your computer is much slower or faster than mine
    // you may need to adjust this value.
    constexpr size_t NextCount = 1 << 28;

    std::vector<uint64_t> CacheStore(NextCount, 0);

    // Get a raw pointer to the vector.
    // On my machine (Ubuntu, g++) this improves
    // performance. Using vector's operator[]
    // results in a function call.
    uint64_t *CachePtr = &CacheStore[0];

    // SetSize is the count of the uint64_t items to be tested.
    // I assume that when this is too big for a cache the data
    // will overflow to the next level.
    // Each loop it doubles in size. I've run it with smaller
    // increments too, and the behaviour
    // is still confusing.
    for (auto SetSize = 1; SetSize < NextCount; SetSize<<=1)
    {
    size_t loopcount = 0;
    size_t j = NextCount / SetSize;
    auto start = std::chrono::steady_clock::now();

    // The outer loop repeats enough times so that the data
    // written by the inner loops of various sizes is
    // approximately constant.
    for (size_t k = 0; k < j; ++k)
    {
    // The inner loop modifies data
    // within a set of words.
    for (size_t l = 0; l < SetSize; ++l)
    {
    // read-modify-write some data.
    // Comment this out
    // to confirm that the looping is not
    // the cause of the anomaly
    ++CachePtr[l];

    // this counts the actual number
    // of memory accesses.
    // rounding errors means that for
    // different SetSize values
    // the count is not completely
    // consistent.
    ++loopcount;
    }
    }

    // Work out how long the loops took in microseconds,
    // then scale to seconds
    auto delta =
    std::chrono::duration_cast<std::chrono::microseconds>
    (std::chrono::steady_clock::now() - start).count()
    / 1e6;

    // calculate how many bytes per second, and print.
    std::cout << "Size " << SetSize << " gave "
    << (double)loopcount * (double)sizeof(uint64_t) /
    delta << " bytes/second." << std::endl;
    }

    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Vir Campestris on Thu Dec 21 14:56:44 2023
    XPost: uk.comp.homebuilt

    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some ideas.

    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
    bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Scott Lurndal on Thu Dec 21 15:32:10 2023
    XPost: uk.comp.homebuilt

    On 21/12/2023 14:56, Scott Lurndal wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some ideas.

    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64 >>> bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).


    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
    hardware these days, but it used to be if you wanted to write a byte
    into your 16-bit memory you had to read both bytes, then write one back.

    And

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Vir Campestris on Thu Dec 21 18:09:33 2023
    XPost: uk.comp.homebuilt

    On Thu, 21 Dec 2023 15:32:10 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    On 21/12/2023 14:56, Scott Lurndal wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some
    ideas.
    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
    single 64 bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).


    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on hardware these days, but it used to be if you wanted to write a byte
    into your 16-bit memory you had to read both bytes, then write one
    back.

    And

    DIMMs have 64-bit data buses. Both these days and previous millennium.
    Now, these days DDR5 DIMM splits 64-bit data bus into pair of
    independent 32-bit channels, but the total is still 64 bits.

    That's a physical perspective. From logical perspective, DDR3 and DDR4
    bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each
    channel exchanges data with controller in 512-bit chunks.

    From signaling perspective, it is still possible (at least on non-ECC
    gear) to tell to memory to write just 8 bits out of 512 and to ignore
    the rest. In PC-class hardware this ability is used very rarely if used
    at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Vir Campestris on Thu Dec 21 18:05:20 2023
    XPost: uk.comp.homebuilt

    Vir Campestris wrote:

    Scott Lurndal wrote:

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).

    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
    hardware these days, but it used to be if you wanted to write a byte
    into your 16-bit memory you had to read both bytes, then write one back.

    I thought intel x64 machines work in cache lines of 64 bytes per memory transaction ... maybe AMD processors are different?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Vir Campestris on Thu Dec 21 18:30:47 2023
    XPost: uk.comp.homebuilt

    Vir Campestris wrote:

    On 20/12/2023 18:58, MitchAlsup wrote:

    Can we see the code ??

    Can you present a table of the timing results ??

    I've run this with more detailed increments on the line size, but here
    are my results for powers of 2.

    { A
    Size 1 gave 3.82242e+09 bytes/second.
    Size 2 gave 3.80533e+09 bytes/second.
    Size 4 gave 2.68017e+09 bytes/second.
    Size 8 gave 2.33751e+09 bytes/second.
    Size 16 gave 2.18424e+09 bytes/second.
    Size 32 gave 2.10243e+09 bytes/second.
    Size 64 gave 1.99371e+09 bytes/second.
    Size 128 gave 1.98475e+09 bytes/second.
    Size 256 gave 2.01653e+09 bytes/second.
    Size 512 gave 2.00884e+09 bytes/second.
    Size 1024 gave 2.02713e+09 bytes/second.
    Size 2048 gave 2.01803e+09 bytes/second.
    } A
    { B
    Size 4096 gave 3.26472e+09 bytes/second.
    Size 8192 gave 3.85126e+09 bytes/second.
    Size 16384 gave 3.85377e+09 bytes/second.
    Size 32768 gave 3.85293e+09 bytes/second.
    Size 65536 gave 2.06793e+09 bytes/second.
    Size 131072 gave 2.06845e+09 bytes/second.
    } B

    A) Here we have a classical sequence pipelines often encounter where
    a simple loop starts off fast 4 bytes/cycle and progressively slows
    down by a factor of 2 (2 bytes per cycle) over some interval of
    complexity (size). What is important is that factor of 2 something
    that took 1 cycle early starts taking 2 cycles later on.

    B) here we have a second classical sequence where the performance at
    some boundary (4096) reverts back to the 1 cycle pipeline of performance
    only to degrade again (in basically the same sequence as A). {{Side
    note: size=4096 has "flutter" in the stairstep. size={8192..32768}
    has peak performance--probably something to do with sets in the cache
    and 4096 is the size of a page (TLB effects)}}.
    I suspect the flutter has something to do with your buffer crossing
    a page boundary.


    The code will continue, but the results are roughly stable for larger sizes.

    The code I have put in a signature block; there's no point in risking
    someone posting it again. I've commented it, but no doubt not in all the right places! I'd be interested to know what results other people get.

    Thanks
    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Michael S on Thu Dec 21 18:36:42 2023
    XPost: uk.comp.homebuilt

    Michael S wrote:

    On Thu, 21 Dec 2023 15:32:10 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    On 21/12/2023 14:56, Scott Lurndal wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where is.
    Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some
    ideas.
    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
    single 64 bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).


    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
    hardware these days, but it used to be if you wanted to write a byte
    into your 16-bit memory you had to read both bytes, then write one
    back.

    And

    DIMMs have 64-bit data buses. Both these days and previous millennium.
    Now, these days DDR5 DIMM splits 64-bit data bus into pair of
    independent 32-bit channels, but the total is still 64 bits.

    All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
    Some SDR (pre 2000) DIMMs had 32-bit busses, some ancient laptop memory carriers had 32-bit busses.

    That's a physical perspective. From logical perspective, DDR3 and DDR4
    bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each channel exchanges data with controller in 512-bit chunks.

    Note:: 512-bis is 64-bytes.

    From signaling perspective, it is still possible (at least on non-ECC
    gear) to tell to memory to write just 8 bits out of 512 and to ignore
    the rest. In PC-class hardware this ability is used very rarely if used
    at all.

    Never justified when CPU uses a cache. So, only unCacheable (sized)
    requests can use this. AMD memory controllers hid this from the DRAMs
    so we at least had the opportunity to recompute ECC on ECC carrying
    DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
    then write the unCacheable data, and place the written data in the
    outbound memory queue with its new ECC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to MitchAlsup on Fri Dec 22 00:55:07 2023
    XPost: uk.comp.homebuilt

    On Thu, 21 Dec 2023 18:36:42 +0000
    mitchalsup@aol.com (MitchAlsup) wrote:

    Michael S wrote:

    On Thu, 21 Dec 2023 15:32:10 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    On 21/12/2023 14:56, Scott Lurndal wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where
    is. Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some
    ideas.
    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
    single 64 bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).


    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
    on hardware these days, but it used to be if you wanted to write a
    byte into your 16-bit memory you had to read both bytes, then
    write one back.

    And

    DIMMs have 64-bit data buses. Both these days and previous
    millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
    pair of independent 32-bit channels, but the total is still 64
    bits.

    All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
    Some SDR (pre 2000) DIMMs had 32-bit busses,

    Are you sure? My impression always was that the word DIMM was invented
    to clearly separate 64-bit DIMMs from 32-bit SIMMs.

    some ancient laptop
    memory carriers had 32-bit busses.

    But were they called just DIMM or SO-DIMM?
    The later indeed had 72-pin variant with 32-bit bus.



    That's a physical perspective. From logical perspective, DDR3 and
    DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
    DIMM each channel exchanges data with controller in 512-bit chunks.


    Note:: 512-bis is 64-bytes.


    Which *not* co-incidentally matches cache line size of majority of x86
    CPUs.

    From signaling perspective, it is still possible (at least on
    non-ECC gear) to tell to memory to write just 8 bits out of 512 and
    to ignore the rest. In PC-class hardware this ability is used very
    rarely if used at all.

    Never justified when CPU uses a cache. So, only unCacheable (sized)
    requests can use this. AMD memory controllers hid this from the DRAMs
    so we at least had the opportunity to recompute ECC on ECC carrying
    DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
    then write the unCacheable data, and place the written data in the
    outbound memory queue with its new ECC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Michael S on Thu Dec 21 23:49:20 2023
    XPost: uk.comp.homebuilt

    Michael S wrote:

    On Thu, 21 Dec 2023 18:36:42 +0000
    mitchalsup@aol.com (MitchAlsup) wrote:

    Michael S wrote:

    On Thu, 21 Dec 2023 15:32:10 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    On 21/12/2023 14:56, Scott Lurndal wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    On 20/12/2023 18:08, Theo wrote:
    Vir Campestris <vir.campestris@invalid.invalid> wrote:
    This is not the right group for this - but I don't know where
    is. Suggestions on a postcard please...

    I'm crossposting this to comp.arch, where they may have some
    ideas.
    <snip>

    For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
    single 64 bit word transferred?

    Yes. My system has a 64 bit CPU and 64 bit main memory.

    Surely your main memory is just a sequence of 8-bit bytes (+ECC).


    AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
    on hardware these days, but it used to be if you wanted to write a
    byte into your 16-bit memory you had to read both bytes, then
    write one back.

    And

    DIMMs have 64-bit data buses. Both these days and previous
    millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
    pair of independent 32-bit channels, but the total is still 64
    bits.

    All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
    Some SDR (pre 2000) DIMMs had 32-bit busses,

    Are you sure? My impression always was that the word DIMM was invented
    to clearly separate 64-bit DIMMs from 32-bit SIMMs.

    Dual In-Line Memory Module means they have pins on both sides of the
    board where pins make contact with the plug. They put pins on both
    sides so they could get ~200 pins {Vdd, Gnd, lock, reset, control,
    and data} this was in the early 1990s.

    some ancient laptop
    memory carriers had 32-bit busses.

    But were they called just DIMM or SO-DIMM?
    The later indeed had 72-pin variant with 32-bit bus.

    That sounds right.


    That's a physical perspective. From logical perspective, DDR3 and
    DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
    DIMM each channel exchanges data with controller in 512-bit chunks.


    Note:: 512-bis is 64-bytes.


    Which *not* co-incidentally matches cache line size of majority of x86
    CPUs.

    Given that x86 at the time the Std committee was doing the first one represented 90% of all computers being sold, and the people on the
    committee wanting to keep it that way, is unsurprising.

    From signaling perspective, it is still possible (at least on
    non-ECC gear) to tell to memory to write just 8 bits out of 512 and
    to ignore the rest. In PC-class hardware this ability is used very
    rarely if used at all.

    Never justified when CPU uses a cache. So, only unCacheable (sized)
    requests can use this. AMD memory controllers hid this from the DRAMs
    so we at least had the opportunity to recompute ECC on ECC carrying
    DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
    then write the unCacheable data, and place the written data in the
    outbound memory queue with its new ECC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to MitchAlsup on Fri Dec 22 13:42:39 2023
    XPost: uk.comp.homebuilt

    On 21/12/2023 18:30, MitchAlsup wrote:
    I've run this with more detailed increments on the line size, but here
    are my results for powers of 2.

    { A
    Size 1 gave 3.82242e+09 bytes/second.
    Size 2 gave 3.80533e+09 bytes/second.
    Size 4 gave 2.68017e+09 bytes/second.
    Size 8 gave 2.33751e+09 bytes/second.
    Size 16 gave 2.18424e+09 bytes/second.
    Size 32 gave 2.10243e+09 bytes/second.
    Size 64 gave 1.99371e+09 bytes/second.
    Size 128 gave 1.98475e+09 bytes/second.
    Size 256 gave 2.01653e+09 bytes/second.
    Size 512 gave 2.00884e+09 bytes/second.
    Size 1024 gave 2.02713e+09 bytes/second.
    Size 2048 gave 2.01803e+09 bytes/second.
    } A
    { B
    Size 4096 gave 3.26472e+09 bytes/second.
    Size 8192 gave 3.85126e+09 bytes/second.
    Size 16384 gave 3.85377e+09 bytes/second.
    Size 32768 gave 3.85293e+09 bytes/second.
    Size 65536 gave 2.06793e+09 bytes/second.
    Size 131072 gave 2.06845e+09 bytes/second.
    } B

    A) Here we have a classical sequence pipelines often encounter where
    a simple loop starts off fast 4 bytes/cycle and progressively slows
    down by a factor of 2 (2 bytes per cycle) over some interval of
    complexity (size). What is important is that factor of 2 something
    that took 1 cycle early starts taking 2 cycles later on.

    B) here we have a second classical sequence where the performance at
    some boundary (4096) reverts back to the 1 cycle pipeline of performance
    only to degrade again (in basically the same sequence as A). {{Side
    note: size=4096 has "flutter" in the stairstep. size={8192..32768}
    has peak performance--probably something to do with sets in the cache
    and 4096 is the size of a page (TLB effects)}}.
    I suspect the flutter has something to do with your buffer crossing a
    page boundary.

    I ran it with the memory access commented out. When I do that the
    overall time is consistent regardless of how many times it goes around
    the inner loop, and how many times the outer one.

    But pipelines are funny things.

    64k x 64 bit words is I think my L3 cache size. It's not very big on my
    CPU, they had to fit a GPU on the die.

    I'd be interested to see the results from anyone else's computer.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Vir Campestris on Fri Dec 22 18:12:36 2023
    XPost: uk.comp.homebuilt

    In comp.arch Vir Campestris <vir.campestris@invalid.invalid> wrote:
    I'd be interested to see the results from anyone else's computer.

    $ neofetch --off
    OS: Kubuntu 23.10 x86_64
    Host: 20QBS03N00 ThinkPad X1 Titanium Gen 1
    Kernel: 6.5.0-14-generic
    CPU: 11th Gen Intel i5-1130G7 (8) @ 4.000GHz
    GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics]
    Memory: 4436MiB / 15704MiB
    $ gcc --version
    gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

    $ g++ -o cache cache.c
    $ ./cache
    Size 1 gave 4.13643e+09 bytes/second.
    Size 2 gave 4.79971e+09 bytes/second.
    Size 4 gave 4.87535e+09 bytes/second.
    Size 8 gave 4.8321e+09 bytes/second.
    Size 16 gave 4.71703e+09 bytes/second.
    Size 32 gave 3.89488e+09 bytes/second.
    Size 64 gave 4.02976e+09 bytes/second.
    Size 128 gave 4.15832e+09 bytes/second.
    Size 256 gave 4.19562e+09 bytes/second.
    Size 512 gave 4.08511e+09 bytes/second.
    Size 1024 gave 4.0796e+09 bytes/second.
    Size 2048 gave 4.11983e+09 bytes/second.
    Size 4096 gave 4.06869e+09 bytes/second.
    Size 8192 gave 4.06807e+09 bytes/second.
    Size 16384 gave 4.06217e+09 bytes/second.
    Size 32768 gave 4.06067e+09 bytes/second.
    Size 65536 gave 4.04791e+09 bytes/second.
    Size 131072 gave 4.06143e+09 bytes/second.
    Size 262144 gave 4.04301e+09 bytes/second.
    Size 524288 gave 4.03872e+09 bytes/second.
    Size 1048576 gave 3.97715e+09 bytes/second.
    Size 2097152 gave 3.97609e+09 bytes/second.
    Size 4194304 gave 3.98361e+09 bytes/second.
    Size 8388608 gave 3.98617e+09 bytes/second.
    Size 16777216 gave 3.98376e+09 bytes/second.
    Size 33554432 gave 3.98504e+09 bytes/second.
    Size 67108864 gave 3.98726e+09 bytes/second.
    Size 134217728 gave 3.99495e+09 bytes/second.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Theo on Fri Dec 22 19:35:37 2023
    XPost: uk.comp.homebuilt

    Theo wrote:
    $ g++ -o cache cache.c
    $ ./cache
    { A
    Size 1 gave 4.13643e+09 bytes/second.
    Size 2 gave 4.79971e+09 bytes/second.
    Size 4 gave 4.87535e+09 bytes/second.
    Size 8 gave 4.8321e+09 bytes/second.
    Size 16 gave 4.71703e+09 bytes/second.
    } A
    { B
    Size 32 gave 3.89488e+09 bytes/second.
    Size 64 gave 4.02976e+09 bytes/second.
    Size 128 gave 4.15832e+09 bytes/second.
    Size 256 gave 4.19562e+09 bytes/second.
    Size 512 gave 4.08511e+09 bytes/second.
    Size 1024 gave 4.0796e+09 bytes/second.
    Size 2048 gave 4.11983e+09 bytes/second.
    Size 4096 gave 4.06869e+09 bytes/second.
    Size 8192 gave 4.06807e+09 bytes/second.
    Size 16384 gave 4.06217e+09 bytes/second.
    Size 32768 gave 4.06067e+09 bytes/second.
    Size 65536 gave 4.04791e+09 bytes/second.
    Size 131072 gave 4.06143e+09 bytes/second.
    Size 262144 gave 4.04301e+09 bytes/second.
    Size 524288 gave 4.03872e+09 bytes/second.
    Size 1048576 gave 3.97715e+09 bytes/second.
    Size 2097152 gave 3.97609e+09 bytes/second.
    Size 4194304 gave 3.98361e+09 bytes/second.
    Size 8388608 gave 3.98617e+09 bytes/second.
    Size 16777216 gave 3.98376e+09 bytes/second.
    Size 33554432 gave 3.98504e+09 bytes/second.
    Size 67108864 gave 3.98726e+09 bytes/second.
    Size 134217728 gave 3.99495e+09 bytes/second.
    } B

    The B group are essentially 4.0 with noise.
    The A group are essentially 4.8 with a startup flutter.
    A 20% down or 25% up change at 32::64.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to MitchAlsup on Fri Dec 22 21:58:13 2023
    XPost: uk.comp.homebuilt

    In comp.arch MitchAlsup <mitchalsup@aol.com> wrote:
    Theo wrote:
    $ g++ -o cache cache.c
    $ ./cache
    { A
    Size 1 gave 4.13643e+09 bytes/second.
    Size 2 gave 4.79971e+09 bytes/second.
    Size 4 gave 4.87535e+09 bytes/second.
    Size 8 gave 4.8321e+09 bytes/second.
    Size 16 gave 4.71703e+09 bytes/second.
    } A
    { B
    Size 32 gave 3.89488e+09 bytes/second.
    ...
    Size 134217728 gave 3.99495e+09 bytes/second.
    } B

    The B group are essentially 4.0 with noise.
    The A group are essentially 4.8 with a startup flutter.
    A 20% down or 25% up change at 32::64.

    The nearest machine I have to the OP is this:

    OS: Ubuntu 20.04.6 LTS x86_64
    Host: 1U4LW-X570/2L2T
    Kernel: 5.4.0-167-generic
    CPU: AMD Ryzen 7 5800X (16) @ 3.800GHz
    GPU: 29:00.0 ASPEED Technology, Inc. ASPEED Graphics Family
    Memory: 3332MiB / 128805MiB

    gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

    Size 1 gave 6.22043e+09 bytes/second.
    Size 2 gave 6.35674e+09 bytes/second.
    Size 4 gave 4.14766e+09 bytes/second.
    Size 8 gave 5.4239e+09 bytes/second.
    Size 16 gave 6.01113e+09 bytes/second.
    Size 32 gave 7.75976e+09 bytes/second.
    Size 64 gave 8.7972e+09 bytes/second.
    Size 128 gave 9.71523e+09 bytes/second.
    Size 256 gave 9.91644e+09 bytes/second.
    Size 512 gave 1.00179e+10 bytes/second.
    Size 1024 gave 1.0065e+10 bytes/second.
    Size 2048 gave 9.78508e+09 bytes/second.
    Size 4096 gave 9.76764e+09 bytes/second.
    Size 8192 gave 9.86537e+09 bytes/second.
    Size 16384 gave 9.90053e+09 bytes/second.
    Size 32768 gave 9.91552e+09 bytes/second.
    Size 65536 gave 9.84556e+09 bytes/second.
    Size 131072 gave 9.78442e+09 bytes/second.
    Size 262144 gave 9.80282e+09 bytes/second.
    Size 524288 gave 9.81447e+09 bytes/second.
    Size 1048576 gave 9.81981e+09 bytes/second.
    Size 2097152 gave 9.81456e+09 bytes/second.
    Size 4194304 gave 9.70057e+09 bytes/second.
    Size 8388608 gave 9.55507e+09 bytes/second.
    Size 16777216 gave 9.44032e+09 bytes/second.
    Size 33554432 gave 9.33896e+09 bytes/second.
    Size 67108864 gave 9.28529e+09 bytes/second.
    Size 134217728 gave 9.25213e+09 bytes/second.

    which seems to have more flutter in group A. I get similar results in a VM running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Theo on Fri Dec 22 23:24:29 2023
    XPost: uk.comp.homebuilt

    Theo wrote:

    In comp.arch MitchAlsup <mitchalsup@aol.com> wrote:


    gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

    { A
    Size 1 gave 6.22043e+09 bytes/second.
    Size 2 gave 6.35674e+09 bytes/second.
    Size 4 gave 4.14766e+09 bytes/second.
    Size 8 gave 5.4239e+09 bytes/second.
    Size 16 gave 6.01113e+09 bytes/second.
    Size 32 gave 7.75976e+09 bytes/second.
    Size 64 gave 8.7972e+09 bytes/second.
    } A
    { B
    Size 128 gave 9.71523e+09 bytes/second.
    Size 256 gave 9.91644e+09 bytes/second.
    Size 512 gave 1.00179e+10 bytes/second.
    Size 1024 gave 1.0065e+10 bytes/second.
    Size 2048 gave 9.78508e+09 bytes/second.
    Size 4096 gave 9.76764e+09 bytes/second.
    Size 8192 gave 9.86537e+09 bytes/second.
    Size 16384 gave 9.90053e+09 bytes/second.
    Size 32768 gave 9.91552e+09 bytes/second.
    Size 65536 gave 9.84556e+09 bytes/second.
    Size 131072 gave 9.78442e+09 bytes/second.
    Size 262144 gave 9.80282e+09 bytes/second.
    Size 524288 gave 9.81447e+09 bytes/second.
    Size 1048576 gave 9.81981e+09 bytes/second.
    Size 2097152 gave 9.81456e+09 bytes/second.
    } B
    { C
    Size 4194304 gave 9.70057e+09 bytes/second.
    Size 8388608 gave 9.55507e+09 bytes/second.
    Size 16777216 gave 9.44032e+09 bytes/second.
    Size 33554432 gave 9.33896e+09 bytes/second.
    Size 67108864 gave 9.28529e+09 bytes/second.
    Size 134217728 gave 9.25213e+09 bytes/second.
    } C

    which seems to have more flutter in group A. I get similar results in a VM running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    A has some kind of startup irregularity: down then up.
    B is essentially constant
    C is slowly degrading (maybe from TLB effects}

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Theo on Sat Dec 23 21:34:14 2023
    XPost: uk.comp.homebuilt

    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A. I get similar results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was seeing -
    but I compiled with -ofast. Does that make a difference?

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Vir Campestris on Sat Dec 23 22:50:10 2023
    XPost: uk.comp.homebuilt

    Vir Campestris <vir.campestris@invalid.invalid> schrieb:
    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A. I get similar results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was seeing -
    but I compiled with -ofast. Does that make a difference?

    That would put it in an executable named "fast" :-)

    -march=native -mtune=native might make more of a difference, depending
    on how good the compiler's model of your hardware is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Mon Dec 25 15:56:19 2023
    XPost: uk.comp.homebuilt

    On 23/12/2023 23:50, Thomas Koenig wrote:
    Vir Campestris <vir.campestris@invalid.invalid> schrieb:
    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A. I get similar results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was seeing -
    but I compiled with -ofast. Does that make a difference?

    That would put it in an executable named "fast" :-)

    Some programs are really fussy about capitalisation!


    -march=native -mtune=native might make more of a difference, depending
    on how good the compiler's model of your hardware is.

    For gcc on modern x86-64 processors (IME at least), the difference
    between -O0 and -O1 is often large, but the difference between -O1 and
    higher levels usually makes far less difference. The processors
    themselves do a good job of things like instruction scheduling and
    register renaming at runtime, and are designed to be good for running
    weakly optimised code. I find it makes a bigger difference on
    processors that don't do as much at run-time, such as microcontroller cores.

    But the "-march=native" can make a very big difference, especially if it
    means the compiler can use SIMD or other advanced instructions. (You
    don't need "-mtune" if you have "march", as it is implied by "-march" -
    you only need both if you want to make a build that will run on many
    x86-64 variants but is optimised for one particular one.) And the "-march=native" benefits go well with some of the higher optimisations -
    thus it is good to combine "-march=native" with "-Ofast" or "-O2".

    In practice, things also vary dramatically according to the type of program.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to David Brown on Wed Dec 27 14:20:08 2023
    XPost: uk.comp.homebuilt

    On 25/12/2023 14:56, David Brown wrote:
    On 23/12/2023 23:50, Thomas Koenig wrote:
    Vir Campestris <vir.campestris@invalid.invalid> schrieb:
    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A.  I get similar results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was seeing -
    but I compiled with -ofast. Does that make a difference?

    That would put it in an executable named "fast" :-)

    YKWIM :P

    Some programs are really fussy about capitalisation!


    -march=native -mtune=native might make more of a difference, depending
    on how good the compiler's model of your hardware is.

    For gcc on modern x86-64 processors (IME at least), the difference
    between -O0 and -O1 is often large, but the difference between -O1 and
    higher levels usually makes far less difference.  The processors
    themselves do a good job of things like instruction scheduling and
    register renaming at runtime, and are designed to be good for running
    weakly optimised code.  I find it makes a bigger difference on
    processors that don't do as much at run-time, such as microcontroller
    cores.

    But the "-march=native" can make a very big difference, especially if it means the compiler can use SIMD or other advanced instructions.  (You
    don't need "-mtune" if you have "march", as it is implied by "-march" -
    you only need both if you want to make a build that will run on many
    x86-64 variants but is optimised for one particular one.)  And the "-march=native" benefits go well with some of the higher optimisations -
    thus it is good to combine "-march=native" with "-Ofast" or "-O2".

    In practice, things also vary dramatically according to the type of
    program.


    mtune=native _does_ make a difference. Somewhat to my surprise it makes
    it _slower_ - the biggest difference being about 5% with a size of 128k
    UINT64.

    I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too)
    Ofast (mostly slightly faster than O3 when I use a capital O) and O3 march=native.

    The pipeline thing made me try something different - instead of
    incrementing a value I used std::copy to copy each word to the next one.

    That rose to a peak rate of about 64MB/S with a size of 32k, dropped
    sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It
    then dropped sharply to 10MB/S for all larger sizes.

    A much more sensible result!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Vir Campestris on Wed Dec 27 16:02:11 2023
    XPost: uk.comp.homebuilt

    On 27/12/2023 15:20, Vir Campestris wrote:
    On 25/12/2023 14:56, David Brown wrote:
    On 23/12/2023 23:50, Thomas Koenig wrote:
    Vir Campestris <vir.campestris@invalid.invalid> schrieb:
    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A.  I get similar results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was seeing - >>>> but I compiled with -ofast. Does that make a difference?

    That would put it in an executable named "fast" :-)

    YKWIM :P

    Some programs are really fussy about capitalisation!


    -march=native -mtune=native might make more of a difference, depending
    on how good the compiler's model of your hardware is.

    For gcc on modern x86-64 processors (IME at least), the difference
    between -O0 and -O1 is often large, but the difference between -O1 and
    higher levels usually makes far less difference.  The processors
    themselves do a good job of things like instruction scheduling and
    register renaming at runtime, and are designed to be good for running
    weakly optimised code.  I find it makes a bigger difference on
    processors that don't do as much at run-time, such as microcontroller
    cores.

    But the "-march=native" can make a very big difference, especially if
    it means the compiler can use SIMD or other advanced instructions.
    (You don't need "-mtune" if you have "march", as it is implied by
    "-march" - you only need both if you want to make a build that will
    run on many x86-64 variants but is optimised for one particular one.)
    And the "-march=native" benefits go well with some of the higher
    optimisations - thus it is good to combine "-march=native" with
    "-Ofast" or "-O2".

    In practice, things also vary dramatically according to the type of
    program.


    mtune=native _does_ make a difference. Somewhat to my surprise it makes
    it _slower_ - the biggest difference being about 5% with a size of 128k UINT64.

    I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too) Ofast (mostly slightly faster than O3 when I use a capital O) and O3 march=native.


    "-Ofast" is like -O3, but additionally allows the compiler to "bend" the
    rules somewhat. For example, it allows optimisations of stores that
    might cause data races if another thread can see the same data, and it
    enables "-ffast-math" which treats floating point as somewhat
    approximate and always finite, rather than following IEEE rules
    strictly. If you are okay with this, then "-Ofast" is fine, but you
    should read the documentation to check that you are happy with it.

    While you are there, you can see a number of other optimisation flags
    that are not enabled by any -O sets. These may or may not help,
    depending on your source code.

    The pipeline thing made me try something different - instead of
    incrementing a value I used std::copy to copy each word to the next one.

    That rose to a peak rate of about 64MB/S with a size of 32k, dropped
    sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It
    then dropped sharply to 10MB/S for all larger sizes.

    A much more sensible result!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Vir Campestris on Wed Dec 27 18:38:42 2023
    XPost: uk.comp.homebuilt

    On Wed, 27 Dec 2023 14:20:08 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    On 25/12/2023 14:56, David Brown wrote:
    On 23/12/2023 23:50, Thomas Koenig wrote:
    Vir Campestris <vir.campestris@invalid.invalid> schrieb:
    On 22/12/2023 21:58, Theo wrote:
    <snip>

    which seems to have more flutter in group A.  I get similar
    results
    in a VM
    running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

    Theo
    Thank you for those. Neither of them show the odd thing I was
    seeing - but I compiled with -ofast. Does that make a difference?


    That would put it in an executable named "fast" :-)

    YKWIM :P

    Some programs are really fussy about capitalisation!


    -march=native -mtune=native might make more of a difference,
    depending on how good the compiler's model of your hardware is.

    For gcc on modern x86-64 processors (IME at least), the difference
    between -O0 and -O1 is often large, but the difference between -O1
    and higher levels usually makes far less difference.  The
    processors themselves do a good job of things like instruction
    scheduling and register renaming at runtime, and are designed to be
    good for running weakly optimised code.  I find it makes a bigger difference on processors that don't do as much at run-time, such as microcontroller cores.

    But the "-march=native" can make a very big difference, especially
    if it means the compiler can use SIMD or other advanced
    instructions.  (You don't need "-mtune" if you have "march", as it
    is implied by "-march" - you only need both if you want to make a
    build that will run on many x86-64 variants but is optimised for
    one particular one.)  And the "-march=native" benefits go well with
    some of the higher optimisations - thus it is good to combine "-march=native" with "-Ofast" or "-O2".

    In practice, things also vary dramatically according to the type of program.


    mtune=native _does_ make a difference. Somewhat to my surprise it
    makes it _slower_ - the biggest difference being about 5% with a size
    of 128k UINT64.

    I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster
    too) Ofast (mostly slightly faster than O3 when I use a capital O)
    and O3 march=native.

    The pipeline thing made me try something different - instead of
    incrementing a value I used std::copy to copy each word to the next
    one.

    That rose to a peak rate of about 64MB/S with a size of 32k, dropped
    sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
    It then dropped sharply to 10MB/S for all larger sizes.

    A much more sensible result!

    GB rather than MB, hopefully


    Andy

    Can you tell us what are you trying to achieve?
    Is it a microbenchmark intended to improve your understanding of Zen1 internals?
    Or part of the code that you want to run as fast as possible?
    If the later, what exactly do you want to be done by the code?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Michael S on Fri Dec 29 17:54:53 2023
    XPost: uk.comp.homebuilt

    On 27/12/2023 16:38, Michael S wrote:
    On Wed, 27 Dec 2023 14:20:08 +0000
    Vir Campestris <vir.campestris@invalid.invalid> wrote:

    That rose to a peak rate of about 64MB/S with a size of 32k, dropped
    sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
    It then dropped sharply to 10MB/S for all larger sizes.

    A much more sensible result!

    GB rather than MB, hopefully


    Andy

    Can you tell us what are you trying to achieve?
    Is it a microbenchmark intended to improve your understanding of Zen1 internals?
    Or part of the code that you want to run as fast as possible?
    If the later, what exactly do you want to be done by the code?


    It is a microbenchmark.

    I was trying to understand cache performance on my system.

    Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

    I and another poster are trying to optimise it. Not for any good reason
    of course... but I was just curious about some of the results I was getting.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Vir Campestris on Fri Dec 29 18:19:09 2023
    XPost: uk.comp.homebuilt

    Vir Campestris wrote:

    It is a microbenchmark.
    I was trying to understand cache performance on my system.
    Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

    I and another poster are trying to optimise it. Not for any good reason
    of course... but I was just curious about some of the results I was
    getting.

    A few years ago this guy popped-up on youtube, saying "I was the author
    of Task Manager on Windows NT", I ignored the recommendation for quite a
    while thinking he just wanted people to blow smoke up his arse,
    eventually I watched and he seems a decent guy ... anyway he did a drag
    racing series using sieve prime algorithm

    <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

    Others have contributed in all manner of languages

    <https://github.com/PlummersSoftwareLLC/Primes>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Vir Campestris on Fri Dec 29 21:30:45 2023
    XPost: uk.comp.homebuilt

    Vir Campestris wrote:

    The code I have put in a signature block; there's no point in risking
    someone posting it again. I've commented it, but no doubt not in all the right places! I'd be interested to know what results other people get.

    Built and run under Win11 with g++ 13.2.0
    on a laptop with
    i7-11370H CPU (3.3GHz base, 4.28GHz turbo)
    L1 d-cache 48kB per core
    L2 cache 1.2MB
    L3 cache 12MB
    16GB memory

    Size 1 gave 5.12047e+09 bytes/second.
    Size 2 gave 5.76467e+09 bytes/second.
    Size 4 gave 5.78979e+09 bytes/second.
    Size 8 gave 5.6642e+09 bytes/second.
    Size 16 gave 5.55303e+09 bytes/second.
    Size 32 gave 4.4151e+09 bytes/second.
    Size 64 gave 4.67783e+09 bytes/second.
    Size 128 gave 4.85115e+09 bytes/second.
    Size 256 gave 4.88147e+09 bytes/second.
    Size 512 gave 4.5214e+09 bytes/second.
    Size 1024 gave 4.64794e+09 bytes/second.
    Size 2048 gave 4.69149e+09 bytes/second.
    Size 4096 gave 4.69416e+09 bytes/second.
    Size 8192 gave 4.73425e+09 bytes/second.
    Size 16384 gave 4.76353e+09 bytes/second.
    Size 32768 gave 4.70864e+09 bytes/second.
    Size 65536 gave 4.75244e+09 bytes/second.
    Size 131072 gave 4.76452e+09 bytes/second.
    Size 262144 gave 4.69839e+09 bytes/second.
    Size 524288 gave 4.71094e+09 bytes/second.
    Size 1048576 gave 4.62922e+09 bytes/second.
    Size 2097152 gave 4.55435e+09 bytes/second.
    Size 4194304 gave 4.58315e+09 bytes/second.
    Size 8388608 gave 4.55917e+09 bytes/second.
    Size 16777216 gave 4.60661e+09 bytes/second.
    Size 33554432 gave 4.63509e+09 bytes/second.
    Size 67108864 gave 4.62349e+09 bytes/second.
    Size 134217728 gave 4.62406e+09 bytes/second.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Andy Burns on Fri Dec 29 22:04:09 2023
    XPost: uk.comp.homebuilt

    Andy Burns wrote:

    Size 1 gave 5.12047e+09 bytes/second.
    Size 2 gave 5.76467e+09 bytes/second.
    Size 4 gave 5.78979e+09 bytes/second.
    Size 8 gave 5.6642e+09 bytes/second.
    Size 16 gave 5.55303e+09 bytes/second.
    Size 32 gave 4.4151e+09 bytes/second.
    Size 64 gave 4.67783e+09 bytes/second.
    Size 128 gave 4.85115e+09 bytes/second.
    Size 256 gave 4.88147e+09 bytes/second.
    Size 512 gave 4.5214e+09 bytes/second.
    Size 1024 gave 4.64794e+09 bytes/second.
    Size 2048 gave 4.69149e+09 bytes/second.
    Size 4096 gave 4.69416e+09 bytes/second.
    Size 8192 gave 4.73425e+09 bytes/second.
    Size 16384 gave 4.76353e+09 bytes/second.
    Size 32768 gave 4.70864e+09 bytes/second.
    Size 65536 gave 4.75244e+09 bytes/second.
    Size 131072 gave 4.76452e+09 bytes/second.
    Size 262144 gave 4.69839e+09 bytes/second.
    Size 524288 gave 4.71094e+09 bytes/second.
    Size 1048576 gave 4.62922e+09 bytes/second.
    Size 2097152 gave 4.55435e+09 bytes/second.
    Size 4194304 gave 4.58315e+09 bytes/second.
    Size 8388608 gave 4.55917e+09 bytes/second.
    Size 16777216 gave 4.60661e+09 bytes/second.
    Size 33554432 gave 4.63509e+09 bytes/second.
    Size 67108864 gave 4.62349e+09 bytes/second.
    Size 134217728 gave 4.62406e+09 bytes/second.

    and (perversely?) with -Ofast

    Size 1 gave 4.64966e+09 bytes/second.
    Size 2 gave 9.39481e+09 bytes/second.
    Size 4 gave 1.59656e+10 bytes/second.
    Size 8 gave 2.6789e+10 bytes/second.
    Size 16 gave 4.42999e+10 bytes/second.
    Size 32 gave 4.51076e+10 bytes/second.
    Size 64 gave 4.90103e+10 bytes/second.
    Size 128 gave 4.08414e+10 bytes/second.
    Size 256 gave 4.36978e+10 bytes/second.
    Size 512 gave 4.54119e+10 bytes/second.
    Size 1024 gave 4.50594e+10 bytes/second.
    Size 2048 gave 4.71622e+10 bytes/second.
    Size 4096 gave 4.60261e+10 bytes/second.
    Size 8192 gave 3.84764e+10 bytes/second.
    Size 16384 gave 3.90878e+10 bytes/second.
    Size 32768 gave 3.76718e+10 bytes/second.
    Size 65536 gave 3.87339e+10 bytes/second.
    Size 131072 gave 3.89058e+10 bytes/second.
    Size 262144 gave 3.85662e+10 bytes/second.
    Size 524288 gave 4.19209e+10 bytes/second.
    Size 1048576 gave 3.00962e+10 bytes/second.
    Size 2097152 gave 1.37616e+10 bytes/second.
    Size 4194304 gave 1.4136e+10 bytes/second.
    Size 8388608 gave 1.39413e+10 bytes/second.
    Size 16777216 gave 1.35747e+10 bytes/second.
    Size 33554432 gave 1.37286e+10 bytes/second.
    Size 67108864 gave 1.38163e+10 bytes/second.
    Size 134217728 gave 1.29987e+10 bytes/second.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Andy Burns on Fri Dec 29 22:02:36 2023
    XPost: uk.comp.homebuilt

    Andy Burns wrote:

    Vir Campestris wrote:

    The code I have put in a signature block; there's no point in risking
    someone posting it again. I've commented it, but no doubt not in all the
    right places! I'd be interested to know what results other people get.

    Built and run under Win11 with g++ 13.2.0
    on a laptop with
    i7-11370H CPU (3.3GHz base, 4.28GHz turbo)
    L1 d-cache 48kB per core
    L2 cache 1.2MB
    L3 cache 12MB
    16GB memory

    Size 1 gave 5.12047e+09 bytes/second.
    Size 2 gave 5.76467e+09 bytes/second.
    Size 4 gave 5.78979e+09 bytes/second.
    Size 8 gave 5.6642e+09 bytes/second.
    Size 16 gave 5.55303e+09 bytes/second.
    Size 32 gave 4.4151e+09 bytes/second.
    Size 64 gave 4.67783e+09 bytes/second.
    Size 128 gave 4.85115e+09 bytes/second.
    Size 256 gave 4.88147e+09 bytes/second.
    Size 512 gave 4.5214e+09 bytes/second.
    Size 1024 gave 4.64794e+09 bytes/second.
    Size 2048 gave 4.69149e+09 bytes/second.
    Size 4096 gave 4.69416e+09 bytes/second.
    Size 8192 gave 4.73425e+09 bytes/second.
    Size 16384 gave 4.76353e+09 bytes/second.
    Size 32768 gave 4.70864e+09 bytes/second.
    Size 65536 gave 4.75244e+09 bytes/second.
    Size 131072 gave 4.76452e+09 bytes/second.
    Size 262144 gave 4.69839e+09 bytes/second.
    Size 524288 gave 4.71094e+09 bytes/second.
    Size 1048576 gave 4.62922e+09 bytes/second.
    Size 2097152 gave 4.55435e+09 bytes/second.
    Size 4194304 gave 4.58315e+09 bytes/second.
    Size 8388608 gave 4.55917e+09 bytes/second.
    Size 16777216 gave 4.60661e+09 bytes/second.
    Size 33554432 gave 4.63509e+09 bytes/second.
    Size 67108864 gave 4.62349e+09 bytes/second.
    Size 134217728 gave 4.62406e+09 bytes/second.


    This data set has a 5/4 ratio best to worst with noise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Andy Burns on Fri Dec 29 22:07:25 2023
    XPost: uk.comp.homebuilt

    Andy Burns wrote:

    and (perversely?) with -Ofast

    I must look at the exponent, not just the mantissa
    I must look at the exponent, not just the mantissa
    I must look at the exponent, not just the mantissa

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Andy Burns on Fri Dec 29 23:59:44 2023
    XPost: uk.comp.homebuilt

    Andy Burns wrote:

    Andy Burns wrote:

    Size 1 gave 5.12047e+09 bytes/second.
    Size 2 gave 5.76467e+09 bytes/second.
    Size 4 gave 5.78979e+09 bytes/second.
    Size 8 gave 5.6642e+09 bytes/second.
    Size 16 gave 5.55303e+09 bytes/second.
    Size 32 gave 4.4151e+09 bytes/second.
    Size 64 gave 4.67783e+09 bytes/second.
    Size 128 gave 4.85115e+09 bytes/second.
    Size 256 gave 4.88147e+09 bytes/second.
    Size 512 gave 4.5214e+09 bytes/second.
    Size 1024 gave 4.64794e+09 bytes/second.
    Size 2048 gave 4.69149e+09 bytes/second.
    Size 4096 gave 4.69416e+09 bytes/second.
    Size 8192 gave 4.73425e+09 bytes/second.
    Size 16384 gave 4.76353e+09 bytes/second.
    Size 32768 gave 4.70864e+09 bytes/second.
    Size 65536 gave 4.75244e+09 bytes/second.
    Size 131072 gave 4.76452e+09 bytes/second.
    Size 262144 gave 4.69839e+09 bytes/second.
    Size 524288 gave 4.71094e+09 bytes/second.
    Size 1048576 gave 4.62922e+09 bytes/second.
    Size 2097152 gave 4.55435e+09 bytes/second.
    Size 4194304 gave 4.58315e+09 bytes/second.
    Size 8388608 gave 4.55917e+09 bytes/second.
    Size 16777216 gave 4.60661e+09 bytes/second.
    Size 33554432 gave 4.63509e+09 bytes/second.
    Size 67108864 gave 4.62349e+09 bytes/second.
    Size 134217728 gave 4.62406e+09 bytes/second.

    and (perversely?) with -Ofast
    { A
    Size 1 gave 4.64966e+09 bytes/second.
    Size 2 gave 9.39481e+09 bytes/second.
    Size 4 gave 1.59656e+10 bytes/second.
    Size 8 gave 2.6789e+10 bytes/second.
    Size 16 gave 4.42999e+10 bytes/second.
    } A
    { B
    Size 32 gave 4.51076e+10 bytes/second.
    Size 64 gave 4.90103e+10 bytes/second.
    Size 128 gave 4.08414e+10 bytes/second.
    Size 256 gave 4.36978e+10 bytes/second.
    Size 512 gave 4.54119e+10 bytes/second.
    Size 1024 gave 4.50594e+10 bytes/second.
    Size 2048 gave 4.71622e+10 bytes/second.
    Size 4096 gave 4.60261e+10 bytes/second.
    } B
    { C
    Size 8192 gave 3.84764e+10 bytes/second.
    Size 16384 gave 3.90878e+10 bytes/second.
    Size 32768 gave 3.76718e+10 bytes/second.
    Size 65536 gave 3.87339e+10 bytes/second.
    Size 131072 gave 3.89058e+10 bytes/second.
    Size 262144 gave 3.85662e+10 bytes/second.
    Size 524288 gave 4.19209e+10 bytes/second.
    } C
    { D
    Size 1048576 gave 3.00962e+10 bytes/second.
    Size 2097152 gave 1.37616e+10 bytes/second.
    Size 4194304 gave 1.4136e+10 bytes/second.
    Size 8388608 gave 1.39413e+10 bytes/second.
    Size 16777216 gave 1.35747e+10 bytes/second.
    Size 33554432 gave 1.37286e+10 bytes/second.
    Size 67108864 gave 1.38163e+10 bytes/second.
    Size 134217728 gave 1.29987e+10 bytes/second.
    } D



    A displays BW follows access size

    B displays essentially the same performance throughout

    C displays a minor step down in performance due to L2 or TLB effects 5 to 4 ratio

    D displays a major stepdown in performance at DIV 3 compared to C 3 to 1 ratio

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Andy Burns on Sat Dec 30 14:57:55 2023
    XPost: uk.comp.homebuilt

    Andy Burns wrote:
    Vir Campestris wrote:

    It is a microbenchmark.
    I was trying to understand cache performance on my system.
    Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

    I and another poster are trying to optimise it. Not for any good
    reason of course... but I was just curious about some of the results I
    was getting.

    A few years ago this guy popped-up on youtube, saying "I was the author
    of Task Manager on Windows NT", I ignored the recommendation for quite a while thinking he just wanted people to blow smoke up his arse,
    eventually I watched and he seems a decent guy ... anyway he did a drag racing series using sieve prime algorithm

    <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

    Dave is a Good Guy!

    I have contributed to his "smallest windows program ever" with a small
    tweak that saved 4 (or 6?) bytes.

    The sieve algorithm used above is more advanced than the fixed modulo 30 program I wrote a ffew decades ago, but similar in approach.

    I decided to use mod 30 because as soon as you get to 30+, there are
    exactly 8 possible prime locations in each block, so it fits perfectly
    in a byte map. :-)

    1,7,11,13,17,19,23,29

    Others have contributed in all manner of languages

    <https://github.com/PlummersSoftwareLLC/Primes>


    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Dec 30 22:05:30 2023
    XPost: uk.comp.homebuilt

    BGB wrote:

    long rfib(long x)
    { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

    At first I thought the following looked pretty good:

    rfib:
    ENTER R29,R0,#0 // just 3 preserved registers.

    CMP R2,R1,#2
    PGE R2,TT
    MOV R1,#1 // return 1
    EXIT R29,R0,#0

    SUB R30,R1,#2
    SUB R1,R1,#1
    CALL rfib // rfib(n-1)
    MOV R29,R1
    MOV R1,R30
    CALL rfib // rfib(n-2)

    ADD R1,R1,R29
    EXIT R29,R0,#0 // return rfib(n-1)+rfib(n-2)

    But then I moved save and restore to the rfib() calls making the last rung
    on the call tree less expensive by 6 memory references each, and by saving restoring only 2 registers rather than 3 at the cost of a MOV, so each non- leaf level in the call tree saves/restores only 4 registers instead of 6::

    rfib:
    SUB R2,R1,#2
    PNLT R2,TT // (n-2) < 0
    MOV R1,#1 // return 1
    RET

    ENTER R30,R0,#0 // just 2 preserved registers.
    MOV R30,R2
    SUB R1,R1,#1
    CALL rfib // rfib(n-1)
    MOV R2,R1
    MOV R1,R30
    MOV R30,R2
    CALL rfib // rfib(n-2)

    ADD R1,R1,R30
    EXIT R30,R0,#0 // return rfib(n-1)+rfib(n-2)

    But then I realized that rfib(n-2) has already been computed by rfib(n-1). Since the size of the redundant element on the stack is constant, rfib(n-1) makes a location available to rfib(n-2) so only 1 call is required instead
    of 2::

    rfib:
    ENTER R0,R0,#8
    STD #0,[SP+8] // setup rfib[n-2]
    CALL rfib1 // recurse through rfib1
    EXIT R0,R0,#8

    rfib1:
    SUB R2,R1,#2
    PNLT R2,TT // (n-2) < 0
    MOV R1,#1 // return 1
    RET

    ENTER R0,R0,#8 // no preserved registers just return address

    SUB R1,R1,#1
    CALL rfib1 // rfib(n-1)
    LDD R2,[SP] // save rfib(n-2)

    ADD R1,R1,R30
    STD R1,[SP+16] // restore rfib(n-2)
    EXIT R0,R0,#8 // return rfib(n-1)+rfib(n-2)

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Sat Dec 30 23:55:52 2023
    XPost: uk.comp.homebuilt

    BGB wrote:

    On 12/30/2023 4:05 PM, MitchAlsup wrote:
    BGB wrote:

       long rfib(long x)
         { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

    At first I thought the following looked pretty good:

    rfib:
         ENTER   R29,R0,#0          // just 3 preserved registers.

         CMP     R2,R1,#2
         PGE     R2,TT
         MOV     R1,#1              // return 1
         EXIT    R29,R0,#0

         SUB     R30,R1,#2
         SUB     R1,R1,#1
         CALL    rfib               // rfib(n-1)
         MOV     R29,R1
         MOV     R1,R30
         CALL    rfib               // rfib(n-2)

         ADD     R1,R1,R29
         EXIT    R29,R0,#0          // return rfib(n-1)+rfib(n-2)


    Theoretically, could be done in my ISA something like:
    rfib:
    ADD -24, SP | MOV LR, R1
    MOV 1, R2 | CMPGE 2, R4
    BF .L0
    MOV.X R12, (SP, 0)
    MOV R4, R12 | MOV.Q R1, (SP, 16)
    ADD R12, -1, R4
    BSR rfib
    MOV R2, R13 | ADD R12, -2, R4
    BSR rfib
    ADD R2, R13, R2
    MOV.Q (SP, 16), R1
    MOV.X (SP, 0), R12
    .L0:
    ADD 24, SP
    JMP R1

    But, my compiler isn't going to pull off anything like this.


    Checking, actual BGBCC output:
    rfib:
    MOV LR, R1
    BSR __prolog_0000_0000020000FC
    ADD -208, SP
    MOV RQ4, RQ13
    // tk_shell_chksvar.c:234 { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
    CMPQGE 2, RQ13
    BT .L00802A62
    MOV 1, RQ10
    BRA .L00802A63

    ..L00802A62:
    SUB RQ13, 1, RQ14
    MOV RQ14, RQ4
    BSR rfib
    MOV RQ2, RQ12
    SUB RQ13, 2, RQ14
    MOV RQ14, RQ4
    BSR rfib
    MOV RQ2, RQ11
    ADD RQ12, RQ11, RQ14
    MOV RQ14, RQ10

    ..L00802A63:
    MOV RQ10, RQ2

    ..L00C108CE:
    ADD 208, SP
    BRA __epilog_0000_0000020000FC
    .balign 4

    ( Yes, this is more MOV's than "technically necessary", but eliminating
    them from the compiler output is "easier said than done" given how some
    parts of the compiler work. RQn/RDn are technically equivalent to Rn,
    but the compiler distinguishes them as early on the idea was that the assembler might distinguish instructions based on type, like
    EAX/RAX/AX/... on x86-64, but it didn't quite happen this way. )

    And, fetching the prolog and epilog:
    __prolog_0000_0000020000FC:
    ADD -48, SP
    MOV.Q R1, (SP, 40)
    MOV.X R12, (SP, 16)
    MOV.Q R14, (SP, 32)
    MOV.X R10, (SP, 0)
    RTS

    __epilog_0000_0000020000FC:
    MOV.Q (SP, 40), R1
    MOV.X (SP, 0), R10
    MOV.X (SP, 16), R12
    MOV.Q (SP, 32), R14
    ADD 48, SP
    JMP R1


    Or, a little closer to the final machine-code (BJX2 Baseline):

    rfib: //@03E27C
    4911 STC R1, R1
    ..reloc __prolog_0000_0000020000FC 0F/RELW20_BJX
    F000_D000 BSR (PC, 0)
    F2F1_D330 ADD -208, SP
    18D4 MOV R4, R13
    F2DA_C802 CMPQGE 2, R13
    ..reloc .L00802A62 0F/RELW20_BJX
    E000_C000 BT (PC, 0)
    DA01 MOV 1, R10
    ..reloc .L00802A63 0F/RELW20_BJX
    F000_C000 BRA (PC, 0)
    3000 NOP

    ..L00802A62: //@03E298
    F2ED_11FF ADD R13, -1, R14
    F04E_1089 MOV R14, R4
    ..reloc rfib 0F/RELW20_BJX
    F000_D000 BSR (PC, 0)
    F4C2_1089 MOV R2, R12 |
    F2ED_11FE ADD R13, -2, R14
    F04E_1089 MOV R14, R4
    ..reloc rfib 0F/RELW20_BJX
    F000_D000 BSR (PC, 0)
    F0B2_1089 MOV R2, R11
    F0EC_10B0 ADD R12, R11, R14
    F0AE_1089 MOV R14, R10

    ..L00802A63: //@03E2C0
    182A MOV R10, R2

    ..L00C108CE: //@03E2C2
    F2F0_D0D0 ADD 208, SP
    ..reloc __epilog_0000_0000020000FC 0F/RELW20_BJX
    F000_C000 BRA (PC, 0)
    3030 BRK


    But then I moved save and restore to the rfib() calls making the last rung >> on the call tree less expensive by 6 memory references each, and by saving >> restoring only 2 registers rather than 3 at the cost of a MOV, so each non- >> leaf level in the call tree saves/restores only 4 registers instead of 6:: >>
    rfib:
         SUB     R2,R1,#2
         PNLT    R2,TT              // (n-2) < 0
         MOV     R1,#1              // return 1
         RET

         ENTER   R30,R0,#0          // just 2 preserved registers.
         MOV     R30,R2
         SUB     R1,R1,#1
         CALL    rfib               // rfib(n-1)
         MOV     R2,R1
         MOV     R1,R30
         MOV     R30,R2
         CALL    rfib               // rfib(n-2)

         ADD     R1,R1,R30
         EXIT    R30,R0,#0          // return rfib(n-1)+rfib(n-2)

    But then I realized that rfib(n-2) has already been computed by rfib(n-1). >> Since the size of the redundant element on the stack is constant, rfib(n-1) >> makes a location available to rfib(n-2) so only 1 call is required instead >> of 2::

    rfib:
         ENTER   R0,R0,#8
         STD     #0,[SP+8]        // setup rfib[n-2]
         CALL    rfib1             // recurse through rfib1 >>      EXIT    R0,R0,#8

    rfib1:
         SUB     R2,R1,#2
         PNLT    R2,TT              // (n-2) < 0
         MOV     R1,#1              // return 1
         RET

         ENTER   R0,R0,#8           // no preserved registers just return
    address

         SUB     R1,R1,#1
         CALL    rfib1              // rfib(n-1)
         LDD     R2,[SP]            // save rfib(n-2)

         ADD     R1,R1,R30
         STD     R1,[SP+16]         // restore rfib(n-2)
         EXIT    R0,R0,#8           // return rfib(n-1)+rfib(n-2)

    ....

    If you cache the intermediate results of the calculations, this entirely defeats its use as a benchmark...

    Exactly:: but there is talk of compilers figuring out that rfib() is a
    pure function and then the compiler itself performs that transformation.

    {{Oh and BTW: Fibonacci should have never been a benchmark (post 1985)}}

    Say, because it drops from ~ O(2^n) to ~ O(n)...

    It also reduces the constant term in BigO ! since the code path is shorter.

    Back in 1982±, S.E.L. got their compiler to recognize the transcendental
    loop as constant, and this one transformation improved Whetstone by 3×.

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Make a file as big as you like, and simply index.

    BigO( const )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pancho@21:1/5 to BGB on Sun Dec 31 09:12:12 2023
    XPost: uk.comp.homebuilt

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...


    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
    recursive. Which a smart compiler could theoretically convert to an
    iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a
    very simple closed form solution.

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only “bad†solution is the naive double recursive one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Daniel James@21:1/5 to Pancho on Sun Dec 31 15:45:32 2023
    XPost: uk.comp.homebuilt

    On 31/12/2023 09:12, Pancho wrote:
    In fact, given that fib grows exponentially, so we only ever need to calculate it for small values of n, the only “bad†solution is the naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you need
    to compute it often enough that efficiency might become relevant just
    put the results in a table and use that. Given that fib goes over 10^12
    after about 60 terms it needn't be a very big table.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones.

    --
    Cheers,
    Daniel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Sun Dec 31 16:22:39 2023
    XPost: uk.comp.homebuilt

    MitchAlsup wrote:
    BGB wrote:

       long rfib(long x)
         { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

    At first I thought the following looked pretty good:

    rfib:
         ENTER   R29,R0,#0          // just 3 preserved registers.

         CMP     R2,R1,#2
         PGE     R2,TT
         MOV     R1,#1              // return 1
         EXIT    R29,R0,#0

         SUB     R30,R1,#2
         SUB     R1,R1,#1
         CALL    rfib               // rfib(n-1)
         MOV     R29,R1
         MOV     R1,R30
         CALL    rfib               // rfib(n-2)

         ADD     R1,R1,R29
         EXIT    R29,R0,#0          // return rfib(n-1)+rfib(n-2)

    But then I moved save and restore to the rfib() calls making the last rung
    on the call tree less expensive by 6 memory references each, and by saving restoring only 2 registers rather than 3 at the cost of a MOV, so each non- leaf level in the call tree saves/restores only 4 registers instead of 6::

    rfib:
         SUB     R2,R1,#2
         PNLT    R2,TT              // (n-2) < 0
         MOV     R1,#1              // return 1
         RET

         ENTER   R30,R0,#0          // just 2 preserved registers.
         MOV     R30,R2
         SUB     R1,R1,#1
         CALL    rfib               // rfib(n-1)
         MOV     R2,R1
         MOV     R1,R30
         MOV     R30,R2
         CALL    rfib               // rfib(n-2)

         ADD     R1,R1,R30
         EXIT    R30,R0,#0          // return rfib(n-1)+rfib(n-2)

    But then I realized that rfib(n-2) has already been computed by rfib(n-1). Since the size of the redundant element on the stack is constant, rfib(n-1) makes a location available to rfib(n-2) so only 1 call is required instead
    of 2::

    That is the huge/obvious/sneaky (take your pick!) Fib optimization,
    since you are effectively getting rid of the O(2^n) exponential
    recursion fanout, making it O(n) instead.

    The step from that to tail recursion elimination is the only thing
    remaining, right?

    What you've done is similar to converting

    f_n = fib(n)

    into

    (f_n,f_n_minus_1) = fib2(n)
    {
    if (n < 2) return (1,1);
    (f2,f1) = fib2(n-1);
    return (f1+f2,f1);
    }

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pancho@21:1/5 to Daniel James on Sun Dec 31 16:40:09 2023
    XPost: uk.comp.homebuilt

    On 31/12/2023 15:45, Daniel James wrote:
    On 31/12/2023 09:12, Pancho wrote:
    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only “bad†solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you need
    to compute it often enough that efficiency might become relevant just
    put the results in a table and use that. Given that fib goes over 10^12
    after about 60 terms it needn't be a very big table.


    Memoization, is effectively a lookup table, but without the thinking. In
    my case, avoiding thinking massively improves code reliability.

    The naive doubly recursive solution has the huge merit of being human-readable.
    Yes, but you'd be surprised how quickly double recursion goes bad. I
    haven't tested recently, but off the top of my head, the maximum
    function call depth would be something like 300, so fib(10) would be a
    stack overflow. fib(60) would need a stack function call depth ~ 1e18.

    Unlike, say:

      double fib( int n )
      {
          return (pow( (1+sqrt(5.0))/2, n+1 )
                  - pow( (1-sqrt(5.0))/2, n+1 ))
              / sqrt(5.0);
      }

    ... which is great for fibs of big values, but not fast for small ones.


    Why isn't it fast for small fibs?

    I seem to recall, if statements are more costly than floating point
    function calls, like exponential.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Andy Burns on Sun Dec 31 16:52:58 2023
    XPost: uk.comp.homebuilt

    On 29/12/2023 22:04, Andy Burns wrote:
    and (perversely?) with -Ofast

    Thank you for those. It looks as though there are some other effects
    going on apart from the cache size - but I can clearly see the drop when
    you run out of cache. Within the caches? Too much noise for me to
    understand!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Pancho on Sun Dec 31 18:03:42 2023
    XPost: uk.comp.homebuilt

    Pancho wrote:
    On 31/12/2023 15:45, Daniel James wrote:
    On 31/12/2023 09:12, Pancho wrote:
    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only “bad†solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.


    Memoization, is effectively a lookup table, but without the thinking. In
    my case, avoiding thinking massively improves code reliability.

    The naive doubly recursive solution has the huge merit of being
    human-readable.
    Yes, but you'd be surprised how quickly double recursion goes bad. I
    haven't tested recently, but off the top of my head, the maximum
    function call depth would be something like 300, so fib(10) would be a
    stack overflow. fib(60) would need a stack function call depth ~ 1e18.

    Unlike, say:

       double fib( int n )
       {
           return (pow( (1+sqrt(5.0))/2, n+1 )
                   - pow( (1-sqrt(5.0))/2, n+1 ))
               / sqrt(5.0);
       }

    ... which is great for fibs of big values, but not fast for small ones.


    Why isn't it fast for small fibs?

    I seem to recall, if statements are more costly than floating point
    function calls, like exponential.

    The recursive fob() function has just a single if () statement, so worst
    case you'll get one or two branch mispredicts, while the 2^n recursive
    calls (for fib(10 or 11)) will probably each take about as long.

    An iterative O(n) version will take 2-4 clcok cycles per iteration, so
    for n less than about 40, you'd only have ~100 clock cycles to evaluate
    the two pow() calls. (I'm assuming the compiler or programmer have pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
    1/sqrt(5), so that only the two pow() calls remain!)

    You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pancho@21:1/5 to Pancho on Sun Dec 31 21:04:03 2023
    XPost: uk.comp.homebuilt

    On 31/12/2023 16:40, Pancho wrote:
    I
    haven't tested recently, but off the top of my head, the maximum
    function call depth would be something like 300, so fib(10) would be a
    stack overflow. fib(60) would need a stack function call depth ~ 1e18.


    Sorry this was a mistake, the function call depth is only ~n not 2^n.
    For fib(n), there are ~2^n function calls, but not 2^n deep on the stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pancho@21:1/5 to Terje Mathisen on Sun Dec 31 21:19:01 2023
    XPost: uk.comp.homebuilt

    On 31/12/2023 17:03, Terje Mathisen wrote:


    The recursive fob() function has just a single if () statement, so worst
    case you'll get one or two branch mispredicts, while the 2^n recursive
    calls (for fib(10 or 11)) will probably each take about as long.


    How many cycles for a branch mispredict? From real life, I have seen if statements be more expensve than pow, but it was a long time ago, and
    possibly code specific. It surprised me at the time, but ever since I
    tend to ignore the cost of functions like pow().


    An iterative O(n) version will take 2-4 clcok cycles per iteration, so
    for n less than about 40, you'd only have ~100 clock cycles to evaluate
    the two pow() calls. (I'm assuming the compiler or programmer have pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
    1/sqrt(5), so that only the two pow() calls remain!)

    You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.


    I generally don't worry about optimizing stuff that is very fast.
    Premature optimisation is the root of all evil and what not. If you are
    really calling it a huge number of times, a lookup is quickest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Daniel James@21:1/5 to Pancho on Mon Jan 1 15:00:28 2024
    XPost: uk.comp.homebuilt

    On 31/12/2023 16:40, Pancho wrote:
    The naive doubly recursive solution has the huge merit of being
    human-readable.

    Yes, but you'd be surprised how quickly double recursion goes bad. I
    haven't tested recently, but off the top of my head, the maximum
    function call depth would be something like 300, so fib(10) would be
    a stack overflow. fib(60) would need a stack function call depth ~
    1e18.

    On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
    without blowing the stack, but it takes minutes (on a Ryzen 7).

    However, the (signed) 64-bit integer result overflows for fib(n) where
    45, which is more of a problem.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small
    ones.


    Why isn't it fast for small fibs?

    Gut feeling. I haven't timed it, and it may be.

    I mean ... obviously it will be as fast for small ones as for large, but
    for small ones it won't beat an iterative calculation and for very small
    ones I doubt it will beat the naive recursive method.

    My point, though, was that clear code is often more important than fast execution, and while the naive recursive method for fibs gets horribly
    slow for large values even it is acceptable for small ones.

    I seem to recall, if statements are more costly than floating point
    function calls, like exponential.

    if statements are costly if they invalidate instruction prefetch and/or
    jump to code that isn't in the cache ... but optimizers and CPUs are
    pretty good at branch prediction and cacheing.

    --
    Cheers,
    Daniel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Daniel James on Mon Jan 1 20:16:43 2024
    XPost: uk.comp.homebuilt

    Daniel James wrote:

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    Looks fast to me::

    fib:
    ADD R2,R1,#1
    POWN R3,#(1+sqrt(5.0))/2,R2
    POWN R4,#(1-sqrt(5.0))/2,R2
    FADD R2,R3,R4
    FDIV R1,R2,#sqrt(5)
    RET

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Mon Jan 1 20:22:23 2024
    XPost: uk.comp.homebuilt

    Terje Mathisen wrote:

    Pancho wrote:
    Unlike, say:

       double fib( int n )
       {
           return (pow( (1+sqrt(5.0))/2, n+1 )
                   - pow( (1-sqrt(5.0))/2, n+1 ))
               / sqrt(5.0);
       }

    ... which is great for fibs of big values, but not fast for small ones.


    Why isn't it fast for small fibs?

    I seem to recall, if statements are more costly than floating point
    function calls, like exponential.

    The recursive fob() function has just a single if () statement, so worst
    case you'll get one or two branch mispredicts, while the 2^n recursive
    calls (for fib(10 or 11)) will probably each take about as long.

    An iterative O(n) version will take 2-4 clcok cycles per iteration, so
    for n less than about 40, you'd only have ~100 clock cycles to evaluate
    the two pow() calls.

    POW is 39 cycles (unpipelined) in My 66000 ISA.

    (I'm assuming the compiler or programmer have pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
    1/sqrt(5), so that only the two pow() calls remain!)

    Plus a FSUB and FDIV.

    You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.

    The Ln2() is 14, the exp2() is another 14, then there is the 90-bit×64-bit multiply 5-cycles and pipeline latency of 5 cycles (and I seem to be missing
    a cycle)

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Pancho on Mon Jan 1 20:26:14 2024
    XPost: uk.comp.homebuilt

    Pancho wrote:

    On 31/12/2023 17:03, Terje Mathisen wrote:


    The recursive fob() function has just a single if () statement, so worst
    case you'll get one or two branch mispredicts, while the 2^n recursive
    calls (for fib(10 or 11)) will probably each take about as long.


    How many cycles for a branch mispredict? From real life, I have seen if statements be more expensve than pow, but it was a long time ago, and possibly code specific. It surprised me at the time, but ever since I
    tend to ignore the cost of functions like pow().

    Long latency branch mispredicts are often in the 13-cycle range. So if you
    are issuing 2 instructions per cycle into the pipeline, you may be throwing
    ~40 instructions per mispredict. For example LD-CMP-BR where the LD misses
    L1 but hits in L2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Pancho on Mon Jan 1 14:03:59 2024
    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail) recursive. Which a smart compiler could theoretically convert to an iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a
    very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    For (4), limitations of floating point cause wrong values to
    be produced, for just for 64-bit inputs.

    Also, simply storing values in a table takes a fair amount
    of memory: almost 3k bytes for 128-bit inputs, or about
    3/4 k bytes for 64-bit inputs.

    Here is a recursive formulation that is better than the
    suggestion given in (2):

    typedef unsigned long long NN;

    static NN fib2( NN a, NN b, NN k );

    NN
    fibonacci( NN n ){
    return fib2( 1, 0, n );
    }

    static NN
    fib2( NN a, NN b, NN k ){
    return k == 0 ? b : fib2( b, a+b, k-1 );
    }

    At optimization -Os or better, the resulting code looks
    very much the same as if a single iterative loop had
    been written instead.

    Actually we can do much better, but we should do at least
    as well as the code shown above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Jan 1 22:46:25 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
    recursive. Which a smart compiler could theoretically convert to an
    iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a
    very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    However, my favorite method is https://oeis.org/A331164 .
    Not because it is particularly effective or easy to understand,
    but because the sequence is nice to listen to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Daniel James on Mon Jan 1 14:29:04 2024
    Daniel James <daniel@me.invalid> writes:

    However, the (signed) 64-bit integer result overflows for fib(n) where
    45, which is more of a problem.

    fib( 92 ) is still less than 1UL << 63 (or fib(93) for a 64-bit
    unsigned type).

    And fib( 186 ) still fits in a 128-bit unsigned type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Mon Jan 1 23:51:03 2024
    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    do { // we know n >=2 so at least 1 path through the loop is performed.
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    } while( n > 2 );
    return fn;
    }

    Saves 1 cycle and and at least 1 instruction of code footprint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Tue Jan 2 08:16:44 2024
    XPost: uk.comp.homebuilt

    MitchAlsup wrote:
    Daniel James wrote:

    Unlike, say:

       double fib( int n )
       {
           return (pow( (1+sqrt(5.0))/2, n+1 )
                   - pow( (1-sqrt(5.0))/2, n+1 ))
               / sqrt(5.0);
       }

    Looks fast to me::

    fib:
        ADD    R2,R1,#1
        POWN    R3,#(1+sqrt(5.0))/2,R2
        POWN    R4,#(1-sqrt(5.0))/2,R2
        FADD    R2,R3,R4
        FDIV    R1,R2,#sqrt(5)
        RET

    The final FDIV should be a FMUL by (1.0/sqrt(5.0))

    Probably followed by a round to nearest int.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Jan 2 08:22:19 2024
    Tim Rentsch wrote:
    Daniel James <daniel@me.invalid> writes:

    However, the (signed) 64-bit integer result overflows for fib(n) where
    45, which is more of a problem.

    fib( 92 ) is still less than 1UL << 63 (or fib(93) for a 64-bit
    unsigned type).

    And fib( 186 ) still fits in a 128-bit unsigned type.

    If calculating fib(n) for n in the fixed 0..186 range is actually
    important to you, then you would obviously allocate 16*187 (about 3KB)
    bytes of table space and do a simple lookup instead.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Daniel James on Tue Jan 2 00:08:45 2024
    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being human-readable.

    But unacceptably slow.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones.

    Unacceptable for fibs of big values, because the values computed
    are wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Tue Jan 2 09:05:00 2024
    MitchAlsup wrote:
    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
      unsigned long f1, f2, fn;

      if (n < 2)
        return 1;

      f1 = 1;
      f2 = 1;
      do { // we know n >=2 so at least 1 path through the loop is performed. >>     fn = f2 + f1;
        f1 = f2;
        f2 = fn;
        n--;
      } while( n > 2 );
      return fn;
    }

    Saves 1 cycle and and at least 1 instruction of code footprint.

    I have written asm code for this, and noted that on x86 where reg MOVes
    used to cost a cycle, it was better to unroll and get rid of the reg-reg
    moves.

    I think the following should work:

    u64 fib(unsigned n)
    {
    if (n-- < 2) return 1;
    a = 1; b = 1;

    a += (n & 1); // Odd remaining n
    n >>= 1;

    while (n--) {
    b += a;
    a += b;
    }
    return a;
    }

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Jan 2 00:24:21 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
    recursive. Which a smart compiler could theoretically convert to an
    iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a
    very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    I'm sure there are other people who feel the same way.

    Even so, when a short, functional, recursive version
    is available, it is often easier to write, easier to
    read, and simpler to show that it produces correct
    results, in which case doing that is a better choice,
    even if an imperative version is not particularly
    intellecually challenging.

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Jan 2 09:27:04 2024
    Tim Rentsch wrote:
    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    But unacceptably slow.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones.

    Unacceptable for fibs of big values, because the values computed
    are wrong.


    So either you move to quad or arbitrary precision (which will of course
    be somewhat slow, or you use a hybrid technique, with lookup tables for smallish n, iterative loop for somewhat larger and your O(log(n))
    approach only for very large n?

    Working with 256-bit uints, each addition will take 4-8 clock cycles,
    and it overflows well before n reaches 400, right? In that case a worst
    case iterative solution is less than 3000 clock cycles or about a
    microsecond away.


    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Jan 2 09:36:17 2024
    Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
    recursive. Which a smart compiler could theoretically convert to an
    iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a >>>> very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    I'm sure there are other people who feel the same way.

    Even so, when a short, functional, recursive version
    is available, it is often easier to write, easier to
    read, and simpler to show that it produces correct
    results, in which case doing that is a better choice,
    even if an imperative version is not particularly
    intellecually challenging.

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.


    So return n for (n<2)?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Tue Jan 2 08:54:11 2024
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    do { // we know n >=2 so at least 1 path through the loop is performed.
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    } while( n > 2 );

    (This should be n >= 2)
    return fn;
    }

    Saves 1 cycle and and at least 1 instruction of code footprint.

    Fortunately, this is one of the transformations which compilers
    do automatically these days. For both versions, the code for
    fib generated by a recent gcc on x86_64 with -O3 is

    fib:
    .LFB11:
    .cfi_startproc
    movl $1, %edx
    cmpl $1, %edi
    jle .L1
    movl $1, %eax
    movl $1, %ecx
    jmp .L3
    .p2align 4,,10
    .p2align 3
    .L5:
    movq %rdx, %rax
    .L3:
    subl $1, %edi
    leaq (%rcx,%rax), %rdx
    movq %rax, %rcx
    cmpl $1, %edi
    jne .L5
    .L1:
    movq %rdx, %rax
    ret

    which looks pretty good.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pancho@21:1/5 to Daniel James on Tue Jan 2 09:23:39 2024
    XPost: uk.comp.homebuilt

    On 01/01/2024 15:00, Daniel James wrote:
    On 31/12/2023 16:40, Pancho wrote:
    The naive doubly recursive solution has the huge merit of being
    human-readable.

    Yes, but you'd be surprised how quickly double recursion goes bad. I
    haven't tested recently, but off the top of my head, the maximum
    function call depth would be something like 300, so fib(10) would be
    a stack overflow. fib(60) would need a stack function call depth ~
    1e18.

    On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
    without blowing the stack, but it takes minutes (on a Ryzen 7).

    However, the (signed) 64-bit integer result overflows for fib(n) where
    45, which is more of a problem.


    Yes, I explained that was a brain fart on my part. I remembered the
    recursive stack limit was low and confused the number of recursive calls
    with stack depth.

    The real place I've had recursive stuff break, is tree/graph handling,
    where a tree may adopt linear characteristics. There are much better
    algorithms than a naive recursive one.


    Unlike, say:

       double fib( int n )
       {
           return (pow( (1+sqrt(5.0))/2, n+1 )
                   - pow( (1-sqrt(5.0))/2, n+1 ))
               / sqrt(5.0);
       }

    ... which is great for fibs of big values, but not fast for small
    ones.


    Why isn't it fast for small fibs?

    Gut feeling. I haven't timed it, and it may be.


    Yes, what I'm saying is it normally isn't valid. Floating point
    functions are fast, and it is rarely something you need to optimise.

    I mean ... obviously it will be as fast for small ones as for large, but
    for small ones it won't beat an iterative calculation and for very small
    ones I doubt it will beat the naive recursive method.


    It generally doesn't matter if something very, very fast is slightly
    faster than something just very fast. Especially when you also have
    very, very slow behaviour elsewhere.

    My point, though, was that clear code is often more important than fast execution, and while the naive recursive method for fibs gets horribly
    slow for large values even it is acceptable for small ones.


    From a mathematical viewpoint, the closed form solution gives clearer behaviour. It really depends on what you want, giving it a standard name
    is probably the most important thing.

    I seem to recall, if statements are more costly than floating point
    function calls, like exponential.

    if statements are costly if they invalidate instruction prefetch and/or
    jump to code that isn't in the cache ... but optimizers and CPUs are
    pretty good at branch prediction and cacheing.


    I was remembering software doing real life financial calcs, but I'm old,
    so it was a long time ago. Pipelines are longer now, but maybe branch prediction is better?, So I don't know if it is more of a problem or
    less of a problem nowadays.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Jan 2 11:24:38 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
    recursive. Which a smart compiler could theoretically convert to an
    iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a >>>> very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    I'm sure there are other people who feel the same way.

    Even so, when a short, functional, recursive version
    is available, it is often easier to write, easier to
    read, and simpler to show that it produces correct
    results, in which case doing that is a better choice,
    even if an imperative version is not particularly
    intellecually challenging.

    Unless, of course, it runs in exponential time, as the
    primitive version with the recurrence does.

    Compilers often do not do a good job of replacing recursion
    with iteration (is there a verb for that? Maybe it can be called iterationizing?), so it is something to consider if an algorithm
    is at all time critical or stack space is problematic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Jan 2 10:56:56 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    [...]

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Jan 2 10:49:13 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail) >>>>> recursive. Which a smart compiler could theoretically convert to an >>>>> iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a >>>>> very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    I'm sure there are other people who feel the same way.

    Even so, when a short, functional, recursive version
    is available, it is often easier to write, easier to
    read, and simpler to show that it produces correct
    results, in which case doing that is a better choice,
    even if an imperative version is not particularly
    intellecually challenging.

    Unless, of course, it runs in exponential time, as the
    primitive version with the recurrence does.

    Sure. Implicit in my comment is that the alternatives being
    considered are algorithmically comparable. Otherwise it's
    comparing apples and oranges.

    Compilers often do not do a good job of replacing recursion
    with iteration (is there a verb for that? Maybe it can be called iterationizing?), so it is something to consider if an algorithm
    is at all time critical or stack space is problematic.

    Of course. I take it as given that recursive functions that use
    an actual recursive call are unacceptable, except perhaps in
    special cases, as for example quicksort. I tend to use functional
    recursion and mutual recursion a fair amount, but not in cases
    where an actual call is done in the recursive cycle.

    By the way a term that may satisfy your question is "tail call
    optimization".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Jan 2 19:10:08 2024
    Thomas Koenig wrote:

    Compilers often do not do a good job of replacing recursion
    with iteration (is there a verb for that? Maybe it can be called iterationizing?), so it is something to consider if an algorithm
    is at all time critical or stack space is problematic.


    Tail recursion elimination.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Jan 2 19:08:49 2024
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    } while( n > 2 );

    (This should be n >= 2)

    OK

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Tue Jan 2 19:01:00 2024
    Terje Mathisen wrote:

    Tim Rentsch wrote:
    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    But unacceptably slow.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones.

    Unacceptable for fibs of big values, because the values computed
    are wrong.


    So either you move to quad or arbitrary precision (which will of course
    be somewhat slow, or you use a hybrid technique, with lookup tables for smallish n, iterative loop for somewhat larger and your O(log(n))
    approach only for very large n?

    Working with 256-bit uints, each addition will take 4-8 clock cycles,
    and it overflows well before n reaches 400, right? In that case a worst
    case iterative solution is less than 3000 clock cycles or about a
    microsecond away.

    Whereas table lookup is never more than 100ns away (in DRAM).

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Jan 2 11:19:06 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    MitchAlsup wrote:

    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    } while( n > 2 );
    return fn;
    }

    Saves 1 cycle and and at least 1 instruction of code footprint.

    I have written asm code for this, and noted that on x86 where reg
    MOVes used to cost a cycle, it was better to unroll and get rid of the reg-reg moves.

    I think the following should work:

    u64 fib(unsigned n)
    {
    if (n-- < 2) return 1;
    a = 1; b = 1;

    a += (n & 1); // Odd remaining n
    n >>= 1;

    while (n--) {
    b += a;
    a += b;
    }
    return a;
    }

    This code still has the off-by-one error w.r.t. the
    value of n.

    We can fix that and simultaneously simplify the code:

    u64
    fib( unsigned n ){
    u64 b = n&1, a = b^1;

    n >>= 1;
    while( n-- ) a += b, b += a;

    return b;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Pancho on Tue Jan 2 11:27:52 2024
    Pancho <Pancho.Jones@proton.me> writes:

    On 01/01/2024 15:00, Daniel James wrote:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small
    ones.

    From a mathematical viewpoint, the closed form solution gives clearer behaviour.

    The problem with using this mathematically exact formula is
    that the code gives wrong answers. Using floating point is
    a poor choice for computing Fibonacci values.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Jan 2 12:19:52 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Daniel James <daniel@me.invalid> writes:

    However, the (signed) 64-bit integer result overflows for fib(n) where
    45, which is more of a problem.

    fib( 92 ) is still less than 1UL << 63 (or fib(93) for a 64-bit
    unsigned type).

    And fib( 186 ) still fits in a 128-bit unsigned type.

    If calculating fib(n) for n in the fixed 0..186 range is actually
    important to you, then you would obviously allocate 16*187 (about 3KB)
    bytes of table space and do a simple lookup instead.

    I have no doubt that some people would, but I wouldn't. It isn't
    hard to write code that has very good performance, and that doesn't
    have the associated costs in terms of memory use, cache behavior,
    and potential page faults and TLB misses.

    (To be clear, I am talking specifically about computing fibonacci
    values.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Jan 2 13:58:05 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    But unacceptably slow.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones.

    Unacceptable for fibs of big values, because the values computed
    are wrong.

    So either you move to quad or arbitrary precision (which will of
    course be somewhat slow,

    To me, what makes floating point unattractive in this case is
    great care must be taken to ensure correct values are produced.
    Add to that the increased time cost, and using floating point
    just looks like a lot more downside than upside.

    or you use a hybrid technique, with lookup
    tables for smallish n, iterative loop for somewhat larger and your
    O(log(n)) approach only for very large n?

    Oh, you remembered I wrote an O(log(n)) fibonacci function. That
    is nice. :)

    Based on what I know now, my inclination is to use the constant-shift
    technique (as seen in another post) for n < 8, the O(log(n)) method
    for results that will be more than 64 bits, and a small-constant-linear
    method in between those two regions. (Incidentally, the intermediate
    method is something I had already written, and might be seen as a generalization of your two-fold speedup code. You might want to take
    a look at enhancing the two-fold speedup idea so it does at most one of
    those, and then after that gets a four-fold speedup. The best
    performing code I have for 64 bits goes two steps farther, so all the iterations after the "startup transient" are 16-fold speedup.)

    Working with 256-bit uints, each addition will take 4-8 clock cycles,
    and it overflows well before n reaches 400, right? In that case a
    worst case iterative solution is less than 3000 clock cycles or about
    a microsecond away.

    By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit
    results. At a guess, 256-bit results would run about 5 to 10
    times slower than that.

    By the way, your two-fold speedup code runs about 1.6 times as
    fast as using simple iteration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Tim Rentsch on Tue Jan 2 22:41:37 2024
    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:



    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    Which is::

    CMP R2,R1,#8
    PLT R2,TTT
    LEA R2,[IP,R1<<2,0xF00000000-.]
    SLL R1,0xd853110,R2
    RET

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Jan 3 14:12:27 2024
    Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Pancho <Pancho.Jones@proton.me> writes:

    On 30/12/2023 23:19, BGB wrote:

    Granted, There are plenty of other much more efficient ways of
    calculating Fibonacci numbers...

    Yes. In order of effort.

    1) Memoize the recursive function.
    2) Return values for (n-1,n-2) as a tuple, and make it single (tail) >>>>>> recursive. Which a smart compiler could theoretically convert to an >>>>>> iterative loop.
    3) Write an iterative loop yourself.
    4) Remember college maths, and solve the recurrence relation to give a >>>>>> very simple closed form solution.

    All of these approaches have drawbacks of one sort or another.

    For (1), we end up writing more code and using more memory.
    Some problems are a good fit to memoization, but computing
    Fibonacci numbers isn't one of the.

    For (2), compilers are not so good at dealing with tail
    recursion when the return value is a pair of results
    rather than a single result.

    For (3), iterative code takes more mental effort to see
    that it works than does a simple recursive version.

    I don't think that

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    is particularly intellecually challenging.

    I'm sure there are other people who feel the same way.

    Even so, when a short, functional, recursive version
    is available, it is often easier to write, easier to
    read, and simpler to show that it produces correct
    results, in which case doing that is a better choice,
    even if an imperative version is not particularly
    intellecually challenging.

    Unless, of course, it runs in exponential time, as the
    primitive version with the recurrence does.

    Sure. Implicit in my comment is that the alternatives being
    considered are algorithmically comparable. Otherwise it's
    comparing apples and oranges.

    Compilers often do not do a good job of replacing recursion
    with iteration (is there a verb for that? Maybe it can be called
    iterationizing?), so it is something to consider if an algorithm
    is at all time critical or stack space is problematic.

    Of course. I take it as given that recursive functions that use
    an actual recursive call are unacceptable, except perhaps in
    special cases, as for example quicksort. I tend to use functional

    In my own quicksort implementation (from 3-4 decades ago) I used
    recursion for the smaller partition and looping for the other/larger
    half. That way stack usage/recursion depth was guaranteed to never
    exceed log2(n/5). The /5 divisor was because I used a different sort
    algorithm on partition sizes <= 5.

    recursion and mutual recursion a fair amount, but not in cases
    where an actual call is done in the recursive cycle.

    By the way a term that may satisfy your question is "tail call
    optimization".

    As noted, that is always a good idea even when having two (or more!)
    recursive calls in the source code.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Wed Jan 3 14:03:29 2024
    Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    #include <stdio.h>

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    do { // we know n >=2 so at least 1 path through the loop is performed. >>> fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    } while( n > 2 );

    (This should be n >= 2)
    return fn;
    }

    Saves 1 cycle and and at least 1 instruction of code footprint.

    Fortunately, this is one of the transformations which compilers
    do automatically these days. For both versions, the code for
    fib generated by a recent gcc on x86_64 with -O3 is

    fib:
    .LFB11:
    .cfi_startproc
    movl $1, %edx
    cmpl $1, %edi
    jle .L1
    movl $1, %eax
    movl $1, %ecx
    jmp .L3
    .p2align 4,,10
    .p2align 3
    .L5:
    movq %rdx, %rax
    .L3:
    subl $1, %edi
    leaq (%rcx,%rax), %rdx
    movq %rax, %rcx
    cmpl $1, %edi
    jne .L5
    .L1:
    movq %rdx, %rax
    ret

    which looks pretty good.

    OK at best:

    The decrement of EDI should be paired with the JNE, saving a CMP, and
    the two reg-reg moves could be avoided with a single level of unrolling.

    OTOH, with 3-way superscalar processing and zero-cycle MOV, it should
    still run at about two cycles/iteration, and a single cycle is a hard
    limit due to the latency of the ADD (or in this case, LEA).

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Jan 3 14:23:11 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    [...]

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    Nice!

    Could be extended a little bit more with a lookup table, or to 16 values
    with a SIMD in-register lookup operation. Using a 64-bit constant allows
    11 values to fit inside 6-bit sub-containers, as long as they are stored
    in reverse order so that the bottom value (0) can survive without the
    top two bits.

    Something like

    if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
    (60-n*6) & 63;

    Anyway, this is code golfing and not a realistic scenario where you
    would definitely use a direct lookup table for all the possible values.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Wed Jan 3 17:53:53 2024
    Terje Mathisen wrote:


    Something like

    if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;


    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
    if( n <= 14 ) return bytes[n];
    if( n <= 25 ) return half[n-14];
    return fib( n-1 ) + fib( n-2 );
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Wed Jan 3 20:17:26 2024
    MitchAlsup wrote:
    Terje Mathisen wrote:


    Something like

       if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
    (60-n*6) & 63;


    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,
    233 };
    unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
        if( n <= 14 ) return bytes[n];
        if( n <= 25 ) return half[n-14];
        return fib( n-1 ) + fib( n-2 );
    }
    Very nice except for not using the unroll-by-two, started with the two
    last table entries? (Or second-to-last to fix any parity issue)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Wed Jan 3 21:05:32 2024
    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
    if( n <= 14 ) return bytes[n];
    if( n <= 25 ) return half[n-14];
    uint64_t a = half[9], b = half[10];
    for( n -= 25; n; n-- )
    {
    b += a;
    a += b;
    }
    return a;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Wed Jan 3 19:43:23 2024
    mitchalsup@aol.com (MitchAlsup) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:



    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    Which is::

    CMP R2,R1,#8
    PLT R2,TTT
    LEA R2,[IP,R1<<2,0xF00000000-.]
    SLL R1,0xd853110,R2
    RET

    ....

    Probably not exactly that, but thank you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Wed Jan 3 20:41:07 2024
    mitchalsup@aol.com (MitchAlsup) writes:

    Terje Mathisen wrote:

    Something like

    if (n < 11) return
    ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;

    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
    10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
    if( n <= 14 ) return bytes[n];
    if( n <= 25 ) return half[n-14];
    return fib( n-1 ) + fib( n-2 );
    }

    Several bugs in this code (and exponential blowup when n > 25).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Jan 3 20:37:19 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    [...]

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    [...]

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    Nice!

    Could be extended a little bit more with a lookup table, or to 16
    values with a SIMD in-register lookup operation.

    No reason to provide more values. It turns out these initial 8
    values are just what the doctor ordered as a lead in to my latest
    iterative version. In earlier versions putting in this special
    case hurt more than it helped.

    Using a 64-bit
    constant allows 11 values to fit inside 6-bit sub-containers, as long
    as they are stored in reverse order so that the bottom value (0) can
    survive without the top two bits.

    Something like

    if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>> (60-n*6) & 63;

    Probably more clever than useful. If I were doing this most likely
    I would write the constant in octal:

    0001010203051015254267u >> (60-6*n) & 63

    Note by the way that you left off the value 55, which puts the
    values 21 and 34 in the wrong places.

    Anyway, this is code golfing and not a realistic scenario where you
    would definitely use a direct lookup table for all the possible
    values.

    A full lookup table is too doggone big. Also part of the point
    of using the constant shift method is not to access data memory,
    as that has other costs (and which are unlikely to show up in
    speed measurements). The computational form is plenty fast
    enough (worst case time between 7 and 8 nanoseconds in my tests).

    This technique of packing several small values into a constant
    and shifting to get a specific result is useful in lots of
    situations. For example, if we are writing a function to test
    for primality, it can start with

    if( n < 64 ) return 0240404242024042424254 >>n &1;

    for a fast test on small integers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Wed Jan 3 20:49:56 2024
    mitchalsup@aol.com (MitchAlsup) writes:

    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144, 233 }; unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765,
    10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
    if( n <= 14 ) return bytes[n];
    if( n <= 25 ) return half[n-14];
    uint64_t a = half[9], b = half[10];
    for( n -= 25; n; n-- )
    {
    b += a;
    a += b;
    }
    return a;
    }

    You need to divide n-25 by 2, and adjust for an even or odd
    value before the divide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Thu Jan 4 08:07:40 2024
    MitchAlsup wrote:
    unsigned char bytes[14] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89, 144,
    233 };
    unsigned short half[11] = {377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368 };

    fib( unsigned n )
    {
        if( n <= 14 ) return bytes[n];
        if( n <= 25 ) return half[n-14];
        uint64_t a = half[9], b = half[10];
        for( n -= 25; n; n-- )
        {
             b += a;
             a += b;
        }
        return a;
    }

    As Tim wrote, you need something like

    uint64_t a = half[8+(n&1)], b = half[9+(n&1)];
    for (n = (n-24)>>1; n; n--) {}

    to start the unrolled loop with the correct parity, although I might
    have an off-by-one error in the (a,b) initializer here...

    Since Tim is using a 4-way unroll, I believe he would be loading 4
    values from a table of at least 7 entries to handle loop alignement.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Thu Jan 4 07:59:21 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    [...]

    unsigned long int fib (int n)
    {
    unsigned long f1, f2, fn;

    if (n < 2)
    return 1;

    f1 = 1;
    f2 = 1;
    while (n >= 2) {
    fn = f2 + f1;
    f1 = f2;
    f2 = fn;
    n--;
    }
    return fn;
    }

    [...]

    Incidentally your code produces wrong values, as
    fib(0) is 0 and fib(1) is 1.

    So return n for (n<2)?

    In situations like this one, usually I would favor something
    more like this:

    if( n < 8 ) return 0xd8532110u >> n*4 & 15;

    Nice!

    Could be extended a little bit more with a lookup table, or to 16
    values with a SIMD in-register lookup operation.

    No reason to provide more values. It turns out these initial 8
    values are just what the doctor ordered as a lead in to my latest
    iterative version. In earlier versions putting in this special
    case hurt more than it helped.

    Using a 64-bit
    constant allows 11 values to fit inside 6-bit sub-containers, as long
    as they are stored in reverse order so that the bottom value (0) can
    survive without the top two bits.

    Something like

    if (n < 11) return ((0<<60)+(1<<54)+(1<<48)+(2<<42)...(21<<6)+34)>>
    (60-n*6) & 63;

    Probably more clever than useful. If I were doing this most likely
    I would write the constant in octal:

    0001010203051015254267u >> (60-6*n) & 63

    Note by the way that you left off the value 55, which puts the
    values 21 and 34 in the wrong places.

    Anyway, this is code golfing and not a realistic scenario where you
    would definitely use a direct lookup table for all the possible
    values.

    A full lookup table is too doggone big. Also part of the point
    of using the constant shift method is not to access data memory,
    as that has other costs (and which are unlikely to show up in
    speed measurements). The computational form is plenty fast
    enough (worst case time between 7 and 8 nanoseconds in my tests).

    I agree, this is indeed "fast enough".

    This technique of packing several small values into a constant
    and shifting to get a specific result is useful in lots of
    situations. For example, if we are writing a function to test
    for primality, it can start with

    if( n < 64 ) return 0240404242024042424254 >>n &1;

    for a fast test on small integers.


    I have been using this particular trick since at least the 386, so
    around 1990. :-)

    It did not make the same kind of sense when registers only had 16 bits
    and a table lookup was about the same cost as a shift & mask.

    Your 64-bit bitmap here is (inversely) related to the modulo 30 packing
    I'm using in my sieve, where the 8 possible prime locations in each
    block of 30 numbers will fit in a byte.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Thu Jan 4 10:09:53 2024
    Terje Mathisen wrote:

    Since Tim is using a 4-way unroll, I believe he would be loading 4
    values from a table of at least 7 entries to handle loop alignement.


    I took another look at this, and it looks pretty nice to use 4 starting
    entries (a..d in increasing order), then a two-cycle recurrence:

    a = c+d
    b = c+2*d
    c = a+b
    d = a+2*b

    which every LEA-capable compiler around could convert into

    lea rax,[rcx+rdx]
    lea rbx,[rcx+2*rdx]

    lea rcx,[rax+rbx]
    lea rdx,[rax+2*rbx]

    The pairs of instructions can run at the same time, so this can run in
    two cycles as long as back-to-back LEA can run in subsequent cycles.

    To run even faster than this I seem to need actual multiplications, with
    small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1)
    f(n+2) = 2*f(n)+f(n-1)
    f(n+3) = 3*f(n)+2*f(n-1)
    f(n+4) = 5*f(n)+3*(f(n-1)
    ...
    f(n+8) = 34*(f(n)+21*f(n-1)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Thu Jan 4 11:00:23 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to >>>>> calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as >>>> you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over >>>> 10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    But unacceptably slow.

    Unlike, say:

    double fib( int n )
    {
    return (pow( (1+sqrt(5.0))/2, n+1 )
    - pow( (1-sqrt(5.0))/2, n+1 ))
    / sqrt(5.0);
    }

    ... which is great for fibs of big values, but not fast for small ones. >>>
    Unacceptable for fibs of big values, because the values computed
    are wrong.

    So either you move to quad or arbitrary precision (which will of
    course be somewhat slow,

    To me, what makes floating point unattractive in this case is
    great care must be taken to ensure correct values are produced.
    Add to that the increased time cost, and using floating point
    just looks like a lot more downside than upside.

    or you use a hybrid technique, with lookup
    tables for smallish n, iterative loop for somewhat larger and your
    O(log(n)) approach only for very large n?

    Oh, you remembered I wrote an O(log(n)) fibonacci function. That
    is nice. :)

    Based on what I know now, my inclination is to use the constant-shift technique (as seen in another post) for n < 8, the O(log(n)) method
    for results that will be more than 64 bits, and a small-constant-linear method in between those two regions. (Incidentally, the intermediate
    method is something I had already written, and might be seen as a generalization of your two-fold speedup code. You might want to take
    a look at enhancing the two-fold speedup idea so it does at most one of those, and then after that gets a four-fold speedup. The best
    performing code I have for 64 bits goes two steps farther, so all the iterations after the "startup transient" are 16-fold speedup.)

    Do you actually get 16 results per clock cycle?

    I don't see _any_ way to do that...

    Working with 256-bit uints, each addition will take 4-8 clock cycles,
    and it overflows well before n reaches 400, right? In that case a
    worst case iterative solution is less than 3000 clock cycles or about
    a microsecond away.

    By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit

    That is impressive!

    results. At a guess, 256-bit results would run about 5 to 10
    times slower than that.

    By the way, your two-fold speedup code runs about 1.6 times as
    fast as using simple iteration.

    That's in the expected ballpark.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Jan 4 04:36:57 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:
    [...]
    This technique of packing several small values into a constant
    and shifting to get a specific result is useful in lots of
    situations. For example, if we are writing a function to test
    for primality, it can start with

    if( n < 64 ) return 0240404242024042424254 >>n &1;

    for a fast test on small integers.

    [...]

    Your 64-bit bitmap here is (inversely) related to the modulo 30
    packing I'm using in my sieve, where the 8 possible prime
    locations in each block of 30 numbers will fit in a byte.

    I wrote one of these about 20 years, and recently was motivated
    to write one again. I'm curious - have you tried running your
    sieve for all primes less than a trillion (1e12)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Jan 4 05:07:09 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Terje Mathisen wrote:

    Since Tim is using a 4-way unroll, I believe he would be loading 4
    values from a table of at least 7 entries to handle loop alignement.

    I took another look at this, and it looks pretty nice to use 4
    starting entries (a..d in increasing order), then a two-cycle
    recurrence:

    a = c+d
    b = c+2*d
    c = a+b
    d = a+2*b

    which every LEA-capable compiler around could convert into

    lea rax,[rcx+rdx]
    lea rbx,[rcx+2*rdx]

    lea rcx,[rax+rbx]
    lea rdx,[rax+2*rbx]

    The pairs of instructions can run at the same time, so this can run in
    two cycles as long as back-to-back LEA can run in subsequent cycles.

    To run even faster than this I seem to need actual multiplications,
    with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1)
    f(n+2) = 2*f(n)+f(n-1)
    f(n+3) = 3*f(n)+2*f(n-1)
    f(n+4) = 5*f(n)+3*(f(n-1)
    ...
    f(n+8) = 34*(f(n)+21*f(n-1)

    As much as I am tempted to post the actual code, let me start
    with something more like a hint, which I fully expect you can
    carry forward to do the whole deal.

    (forward by 0) a b
    (forward by 1) b a+b
    (forward by 2) a+b a+2b
    (forward by 3) a+2b 2a+3b
    (forward by 4) 2a+3b 3a+5b
    (forward by 5) 3a+5b 5a+8b
    (forward by 6) 5a+8b 8a+13b

    and so forth.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Thu Jan 4 05:13:57 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Just one short followup here for now...

    By test, my O(log(n)) code runs in 25-30 nanoseconds for 128-bit

    That is impressive!

    I took the O(log(n)) code and put it in python (with arbirary
    precision integers). Running fib( 1000000 ) took well under
    a second.

    Exponentials are insanely slow, and O(log(n)) is insanely fast...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Thu Jan 4 18:00:45 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:
    [...]
    This technique of packing several small values into a constant
    and shifting to get a specific result is useful in lots of
    situations. For example, if we are writing a function to test
    for primality, it can start with

    if( n < 64 ) return 0240404242024042424254 >>n &1;

    for a fast test on small integers.

    [...]

    Your 64-bit bitmap here is (inversely) related to the modulo 30
    packing I'm using in my sieve, where the 8 possible prime
    locations in each block of 30 numbers will fit in a byte.

    I wrote one of these about 20 years, and recently was motivated
    to write one again. I'm curious - have you tried running your
    sieve for all primes less than a trillion (1e12)?

    No, I wrote this back in the 32-bit days and even though it is blocked
    (with a 128 kB block size corresponding to nearly 4M numbers sieved per
    block afair), I never ran it _that_ far!

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Thu Jan 4 23:08:26 2024
    Terje Mathisen wrote:


    To run even faster than this I seem to need actual multiplications, with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1)
    f(n+2) = 2*f(n)+f(n-1)
    f(n+3) = 3*f(n)+2*f(n-1)
    f(n+4) = 5*f(n)+3*(f(n-1)
    ....
    f(n+8) = 34*(f(n)+21*f(n-1)

    Combining::

    unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
    144, 233, 377, 610, 987, 1597, 2584 };

    fib( unsigned n )
    {
    i = n & 15;
    a = table[i ];
    b = table[i+1];
    for( n -= i; n ; n -= 16 )
    {
    i = n & 15;
    a = a*table[i ] + b*table[i+1];
    b = a*table[i+1] + b*table[i+2];
    }
    return a;
    }

    Knocking off 16 fibs per loop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to MitchAlsup on Thu Jan 4 23:19:28 2024
    MitchAlsup wrote:

    Terje Mathisen wrote:


    To run even faster than this I seem to need actual multiplications, with
    small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1)
    f(n+2) = 2*f(n)+f(n-1)
    f(n+3) = 3*f(n)+2*f(n-1)
    f(n+4) = 5*f(n)+3*(f(n-1)
    ....
    f(n+8) = 34*(f(n)+21*f(n-1)

    Combining::

    unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
    144, 233, 377, 610, 987, 1597, 2584 };

    fib( unsigned n )
    {
    i = n & 15;
    a = table[i ];
    b = table[i+1];
    for( n -= i; n ; n -= 16 )
    {
    i = n & 15;
    t = a*table[i ] + b*table[i+1];
    b = a*table[i+1] + b*table[i+2];
    a = t;
    }
    return a;
    }

    Knocking off 16 fibs per loop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Fri Jan 5 10:04:31 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Terje Mathisen wrote:

    Since Tim is using a 4-way unroll, I believe he would be loading
    4 values from a table of at least 7 entries to handle loop
    alignement.

    I took another look at this, and it looks pretty nice to use 4
    starting entries (a..d in increasing order), then a two-cycle
    recurrence:

    a = c+d b = c+2*d c = a+b d = a+2*b

    which every LEA-capable compiler around could convert into

    lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]

    lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]

    The pairs of instructions can run at the same time, so this can run
    in two cycles as long as back-to-back LEA can run in subsequent
    cycles.

    To run even faster than this I seem to need actual
    multiplications, with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
    3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
    34*(f(n)+21*f(n-1)

    As much as I am tempted to post the actual code, let me start with
    something more like a hint, which I fully expect you can carry
    forward to do the whole deal.

    (forward by 0) a b (forward by 1) b a+b (forward
    by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
    4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
    5a+8b 8a+13b

    and so forth.

    Isn't that exactly the same as what I posted just above, i.e. that
    subsequent levels require pairs of multiplications by smaller fib numbers?

    All these muls are of course independent, but since we only have a very
    small number of multipliers in current CPUS, we cannot expect to do more
    than a pair of them in each cycle (all with latency 3-5).

    Doing it with adds instead means that we have to generate a,2*a,3*a,5*a,8*a,13*a etc.

    We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
    gets us to the next 5 fib numbers all done in parallel and at latency 1
    or 2:

    All these can be done in parallel on a 5-wide CPU:
    a5 = LEA(a,a*4);
    b3 = LEA(b,b*2); b5 = LEA(b,b*4);
    f(n) = a+b;
    f(n+1) = LEA(b+a*2);

    These next operations can run in the second cycle, so we've generated 5
    new fib() numbers in two cycles:

    f(n+2) = f(n)+f(n+1);
    f(n+3) = a5+b3;
    f(n+4) = LEA(b5+a*8);

    In the same cycle we can generate a couple more aggregate multiples:

    a13 = LEA(a5,a*8); b13=LEA(b5,b*8);

    and then we get to spew out the 6th fib value in the third cycle

    f(n+5) = LEA(a13,b*8);

    but I get the feel that it is better to stay at 4 or 5 new numbers in
    each step?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Fri Jan 5 08:55:10 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Terje Mathisen wrote:

    Since Tim is using a 4-way unroll, I believe he would be loading
    4 values from a table of at least 7 entries to handle loop
    alignement.

    I took another look at this, and it looks pretty nice to use 4
    starting entries (a..d in increasing order), then a two-cycle
    recurrence:

    a = c+d b = c+2*d c = a+b d = a+2*b

    which every LEA-capable compiler around could convert into

    lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]

    lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]

    The pairs of instructions can run at the same time, so this can run
    in two cycles as long as back-to-back LEA can run in subsequent
    cycles.

    To run even faster than this I seem to need actual
    multiplications, with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
    3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
    34*(f(n)+21*f(n-1)

    As much as I am tempted to post the actual code, let me start with
    something more like a hint, which I fully expect you can carry
    forward to do the whole deal.

    (forward by 0) a b (forward by 1) b a+b (forward
    by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
    4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
    5a+8b 8a+13b

    and so forth.

    Isn't that exactly the same as what I posted just above, i.e. that
    subsequent levels require pairs of multiplications by smaller fib
    numbers?

    I didn't really follow what you were saying, but I'm pretty
    sure the two aren't the same, because I keep only two values,
    a and b, no matter how many levels of x-fold-speedup is done.

    All these muls are of course independent, but since we only have a
    very small number of multipliers in current CPUS, we cannot expect to
    do more than a pair of them in each cycle (all with latency 3-5).

    Doing it with adds instead means that we have to generate a,2*a,3*a,5*a,8*a,13*a etc.

    We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
    gets us to the next 5 fib numbers all done in parallel and at latency
    1 or 2:

    All these can be done in parallel on a 5-wide CPU:
    a5 = LEA(a,a*4);
    b3 = LEA(b,b*2); b5 = LEA(b,b*4);
    f(n) = a+b;
    f(n+1) = LEA(b+a*2);

    These next operations can run in the second cycle, so we've generated
    5 new fib() numbers in two cycles:

    f(n+2) = f(n)+f(n+1);
    f(n+3) = a5+b3;
    f(n+4) = LEA(b5+a*8);

    In the same cycle we can generate a couple more aggregate multiples:

    a13 = LEA(a5,a*8); b13=LEA(b5,b*8);

    and then we get to spew out the 6th fib value in the third cycle

    f(n+5) = LEA(a13,b*8);

    but I get the feel that it is better to stay at 4 or 5 new numbers in
    each step?

    In the code that follows, b is the current fibonacci number, and a
    is the previous fibonacci number.

    U64
    fib_example( unsigned n ){
    U64 b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    U64 na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    Note that fib(8) is 21, and fib(16) is 987.

    It should be easy to see how to modify this code to extend it one
    level and get a 32-fold speedup for the last stage, which does in
    fact give a significant performance improvement.

    Also I have no doubt that you can see ways to make the code run
    even faster. I'm interested to see what you come up with, with
    the stipulation that no lookup tables be used.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Fri Jan 5 19:58:34 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Terje Mathisen wrote:

    Since Tim is using a 4-way unroll, I believe he would be loading
    4 values from a table of at least 7 entries to handle loop
    alignement.

    I took another look at this, and it looks pretty nice to use 4
    starting entries (a..d in increasing order), then a two-cycle
    recurrence:

    a = c+d b = c+2*d c = a+b d = a+2*b

    which every LEA-capable compiler around could convert into

    lea rax,[rcx+rdx] lea rbx,[rcx+2*rdx]

    lea rcx,[rax+rbx] lea rdx,[rax+2*rbx]

    The pairs of instructions can run at the same time, so this can run
    in two cycles as long as back-to-back LEA can run in subsequent
    cycles.

    To run even faster than this I seem to need actual
    multiplications, with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1) f(n+2) = 2*f(n)+f(n-1) f(n+3) =
    3*f(n)+2*f(n-1) f(n+4) = 5*f(n)+3*(f(n-1) ... f(n+8) =
    34*(f(n)+21*f(n-1)

    As much as I am tempted to post the actual code, let me start with
    something more like a hint, which I fully expect you can carry
    forward to do the whole deal.

    (forward by 0) a b (forward by 1) b a+b (forward
    by 2) a+b a+2b (forward by 3) a+2b 2a+3b (forward by
    4) 2a+3b 3a+5b (forward by 5) 3a+5b 5a+8b (forward by 6)
    5a+8b 8a+13b

    and so forth.

    Isn't that exactly the same as what I posted just above, i.e. that
    subsequent levels require pairs of multiplications by smaller fib
    numbers?

    I didn't really follow what you were saying, but I'm pretty
    sure the two aren't the same, because I keep only two values,
    a and b, no matter how many levels of x-fold-speedup is done.

    All these muls are of course independent, but since we only have a
    very small number of multipliers in current CPUS, we cannot expect to
    do more than a pair of them in each cycle (all with latency 3-5).

    Doing it with adds instead means that we have to generate
    a,2*a,3*a,5*a,8*a,13*a etc.

    We can use LEA and shifts to handle a*(2,3,5,8) and b ditto, so that
    gets us to the next 5 fib numbers all done in parallel and at latency
    1 or 2:

    All these can be done in parallel on a 5-wide CPU:
    a5 = LEA(a,a*4);
    b3 = LEA(b,b*2); b5 = LEA(b,b*4);
    f(n) = a+b;
    f(n+1) = LEA(b+a*2);

    These next operations can run in the second cycle, so we've generated
    5 new fib() numbers in two cycles:

    f(n+2) = f(n)+f(n+1);
    f(n+3) = a5+b3;
    f(n+4) = LEA(b5+a*8);

    In the same cycle we can generate a couple more aggregate multiples:

    a13 = LEA(a5,a*8); b13=LEA(b5,b*8);

    and then we get to spew out the 6th fib value in the third cycle

    f(n+5) = LEA(a13,b*8);

    but I get the feel that it is better to stay at 4 or 5 new numbers in
    each step?

    In the code that follows, b is the current fibonacci number, and a
    is the previous fibonacci number.

    U64
    fib_example( unsigned n ){
    U64 b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    U64 na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    Note that fib(8) is 21, and fib(16) is 987.

    You are skipping intermediate numbers!


    It should be easy to see how to modify this code to extend it one
    level and get a 32-fold speedup for the last stage, which does in
    fact give a significant performance improvement.

    I should have realized that you were in fact doing so since you showed
    your log2(n) algorithm, but I was in fact assuming that you had to
    generate all intermediate values as well.

    Also I have no doubt that you can see ways to make the code run
    even faster. I'm interested to see what you come up with, with
    the stipulation that no lookup tables be used.

    The four MULs will be the limiting factor here, but the key idea is that
    there will only ever be four of them, so this could go however far you
    want to take it.

    With four MULs you should get a new (a,b) pair every 6-7 clock cycles
    (assuming two fully pipelined MUL units), this isn't that much faster
    than code which generates 5 new fib values every two cycles, and then
    you actually get to see all intermediate values as well.

    Still very nice idea. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to MitchAlsup on Sat Jan 6 07:35:15 2024
    mitchalsup@aol.com (MitchAlsup) writes:

    MitchAlsup wrote:

    Terje Mathisen wrote:


    To run even faster than this I seem to need actual multiplications,
    with small fibonacci numbers as the multipliers:

    f(n+1) = f(n)+f(n-1)
    f(n+2) = 2*f(n)+f(n-1)
    f(n+3) = 3*f(n)+2*f(n-1)
    f(n+4) = 5*f(n)+3*(f(n-1)
    ....
    f(n+8) = 34*(f(n)+21*f(n-1)

    Combining::

    unsigned short table[18] = { 0, 1, 1, 2, 3, 5, 8, 13, 34, 55, 89,
    144, 233, 377, 610, 987, 1597, 2584 };

    fib( unsigned n )
    {
    i = n & 15;
    a = table[i ];
    b = table[i+1];
    for( n -= i; n ; n -= 16 )
    {
    i = n & 15;

    t = a*table[i ] + b*table[i+1];

    b = a*table[i+1] + b*table[i+2];

    a = t;

    After n -= i, the lower order four bits of n are zero, which means
    the coefficients being used are 0, 1, and 1. Also the table is
    missing the value 21.

    U64
    fib( unsigned n ){
    static unsigned short table[17] = {
    1, 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610,
    };
    U64 a = table[n&15], b = table[(n&15)+1];

    for( n >>= 4; n--; ){
    U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to BGB on Sat Jan 6 09:23:52 2024
    BGB <cr88192@gmail.com> writes:

    On 1/2/2024 2:08 AM, Tim Rentsch wrote:

    Daniel James <daniel@me.invalid> writes:

    On 31/12/2023 09:12, Pancho wrote:

    In fact, given that fib grows exponentially, so we only ever need to
    calculate it for small values of n, the only ?bad? solution is the
    naive double recursive one.

    Well, I might disagree.

    Given that we only ever need to calculate it for small values of n (as
    you say, and if that's true) its efficiency doesn't matter. If you
    need to compute it often enough that efficiency might become relevant
    just put the results in a table and use that. Given that fib goes over
    10^12 after about 60 terms it needn't be a very big table.

    When the size of the table is more than ten times the size of
    the code, and the code is reasonably fast, using a table is a
    poor choice. Fibonacci functions don't have to be slow.

    The naive doubly recursive solution has the huge merit of being
    human-readable.

    But unacceptably slow.

    [..]

    But, say, if one wanted something less slow, say (untested):
    long fasterfib(int n)
    {
    long c0, c1, c2;
    if(n<2)
    return(1);
    c0=1;
    c1=1;
    c2=c0+c1;
    while(n>2)
    {
    c0=c1;
    c1=c2;
    c2=c0+c1;
    n--;
    }
    return(c2);
    }

    The Fibonacci numbers start at 0: fib(0) == 0, fib(1) == 1, ...
    The code above produces something else.

    What you're showing is a standard iterative version, much like
    the one posted by Thomas Koenig early on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Jan 6 10:20:16 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    [...]

    In the code that follows, b is the current fibonacci number, and a
    is the previous fibonacci number.

    U64
    fib_example( unsigned n ){
    U64 b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    U64 na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    U64 na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    Note that fib(8) is 21, and fib(16) is 987.

    You are skipping intermediate numbers!

    Oh yes. It never occurred to me to do anything but.

    [...]

    Still very nice idea. :-)

    Thank you. It seem obvious in retrospect, but that property
    holds for lots of ideas that are not obvious ahead of time.

    I have an interesting footnote. My first implementation of this
    idea was written using a functional recursive style. I was doing
    performance measurements using a relatively low optimization setting
    (-Os). Just this morning I decided to turn up the level to -O3, and
    was surprised to see that the functional version ran faster than
    everything else, 25% faster than the closest imperative version.
    I've seen this result before in other cases - code written in a
    functional style actually runs faster than code written in an
    imperative style.

    Here is the code:

    typedef unsigned long long U64;

    static U64 ff1( U64, U64, unsigned );
    static U64 ff2( U64, U64, unsigned );
    static U64 ff3( U64, U64, unsigned );
    static U64 ff4( U64, U64, unsigned );

    U64
    functional_fib( unsigned n ){
    return
    n < 8 ? 0xd8532110u >>(n*4) &15 :
    n & 1 ? ff1( 0, 1, n/2 ) :
    /*******/ ff1( 1, 0, n/2 );
    }

    U64
    ff1( U64 a, U64 b, unsigned k ){
    return k & 1 ? ff2( a+b, a+2*b, k/2 ) : ff2( a, b, k/2 );
    }

    U64
    ff2( U64 a, U64 b, unsigned k ){
    return k & 1 ? ff3( 2*a+3*b, 3*a+5*b, k/2 ) : ff3( a, b, k/2 );
    }

    U64
    ff3( U64 a, U64 b, unsigned k ){
    return k & 1 ? ff4( 13*a+21*b, 21*a+34*b, k/2 ) : ff4( a, b, k/2 );
    }

    U64
    ff4( U64 a, U64 b, unsigned k ){
    return k ? ff4( 610*a+987*b, 987*a+1597*b, k-1 ) : b;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul A. Clayton on Tue Jan 9 07:41:51 2024
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I also vaguely recall reading at some point that even
    recursion that was "easy" to convert to tail recursion was
    not handled by gcc (also many years ago).

    What looks like a tail-call to a Lisp or Scheme programmer actually is
    not when it comes to C. That's because in C calling conventions, the
    caller is responsible for cleaning up the stack, and this aspect of
    the calling conventions is necessitated by the combination of varargs
    and the optionality of function declarations.

    So when in a C program you have

    ...
    return foo(...);
    }

    in assembly language this is:

    call foo
    add $96, %rsp ;or something like that
    return

    and tail-call elimination of that is not possible in general. gcc has something called sib-call elimination ("sib" for sibling), and AFAIK
    this includes tail-recursion elimination, but I expect that even that
    does not work at all times, e.g., because a pointer is passed to the
    called function, and the C compiler cannot prove that it does not
    point to something that is on the stack frame of the caller.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Jan 10 03:16:13 2024
    Anton Ertl wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I also vaguely recall reading at some point that even
    recursion that was "easy" to convert to tail recursion was
    not handled by gcc (also many years ago).

    What looks like a tail-call to a Lisp or Scheme programmer actually is
    not when it comes to C. That's because in C calling conventions, the
    caller is responsible for cleaning up the stack, and this aspect of
    the calling conventions is necessitated by the combination of varargs
    and the optionality of function declarations.

    Both the caller and the callee are responsible for cleaning up their own
    parts of the stack including argument passing and result returning.

    My 66000 ABU is varargs compliant and yet the caller does not need to
    know the function it is calling is varargs. Caller just lobs up to 8
    arguments into R1..R8 with argument[9]+ on the stack at [SP]+; when the
    callee is varargs, callee pushes R8..R1 onto the stack creating a vector
    of arguments which can be indexed as subroutine consumes one varags after another..

    So when in a C program you have

    ...
    return foo(...);
    }

    in assembly language this is:

    call foo
    add $96, %rsp ;or something like that
    return

    The add $96 is the "subroutine calling foo()" cleaning itself up.

    and tail-call elimination of that is not possible in general.

    Prior to control transfer, the stack must be as expected at the entry
    point and the local data must already be in its expected place {stack, registers}. This makes it harder, but not impossible. So, tail call
    elimination requires running through the epilogue so you can branch
    back to the prologue. If you understand this, you can tail-call to
    post prologue code and leave the stack alone.

    So, instead of::

    foo:
    prologue
    foo__body:
    body
    bc foo__epilogue
    call foo()
    foo__epilogue:
    epilogue
    RET

    one can write:

    foo:
    prologue
    foo__body:
    body
    bc foo__body
    epilogue
    RET

    The reason it works well and smoothly in LISP and Scheme is that both
    languages would be DOA without it.

    gcc has something called sib-call elimination ("sib" for sibling), and AFAIK
    this includes tail-recursion elimination, but I expect that even that
    does not work at all times, e.g., because a pointer is passed to the
    called function, and the C compiler cannot prove that it does not
    point to something that is on the stack frame of the caller.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Wed Jan 10 18:27:34 2024
    On Tue, 09 Jan 2024 07:41:51 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I also vaguely recall reading at some point that even
    recursion that was "easy" to convert to tail recursion was
    not handled by gcc (also many years ago).

    What looks like a tail-call to a Lisp or Scheme programmer actually is
    not when it comes to C. That's because in C calling conventions, the
    caller is responsible for cleaning up the stack, and this aspect of
    the calling conventions is necessitated by the combination of varargs
    and the optionality of function declarations.

    So when in a C program you have

    ...
    return foo(...);
    }

    in assembly language this is:

    call foo
    add $96, %rsp ;or something like that
    return

    and tail-call elimination of that is not possible in general. gcc has >something called sib-call elimination ("sib" for sibling), and AFAIK
    this includes tail-recursion elimination, but I expect that even that
    does not work at all times, e.g., because a pointer is passed to the
    called function, and the C compiler cannot prove that it does not
    point to something that is on the stack frame of the caller.

    - anton

    You can tail-call and continuation-pass in C with the
    Cheney-on-the-MTA technique.

    https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

    TLDR; the idea is to bookmark the stack with setjmp(), allow it to
    grow until some high-water mark is passed, then longjmp() to release
    it all at once.

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Thu Jan 11 07:59:16 2024
    George Neuner <gneuner2@comcast.net> writes:
    On Tue, 09 Jan 2024 07:41:51 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:
    So when in a C program you have

    ...
    return foo(...);
    }

    in assembly language this is:

    call foo
    add $96, %rsp ;or something like that
    return

    and tail-call elimination of that is not possible in general. gcc has >>something called sib-call elimination ("sib" for sibling), and AFAIK
    this includes tail-recursion elimination, but I expect that even that
    does not work at all times, e.g., because a pointer is passed to the
    called function, and the C compiler cannot prove that it does not
    point to something that is on the stack frame of the caller.

    - anton

    You can tail-call and continuation-pass in C with the
    Cheney-on-the-MTA technique.

    https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

    TLDR; the idea is to bookmark the stack with setjmp(), allow it to
    grow until some high-water mark is passed, then longjmp() to release
    it all at once.

    Yes, you can implement the aspect that tail calls are guaranteed to be unlimited in languages like Scheme in that way, but it's not
    particularly efficient.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)