• OT: A better sieve (was Re: Sieve of Erastosthenes optimized to the max

    From Vir Campestris@21:1/5 to All on Mon Jan 29 21:31:12 2024
    I retired last year, and I haven't really written any code since. This
    has turned out to be quite a fun little thing of a type I haven't had
    time for for YEARS. And oddly I still don't seem to have enough time for
    it... It's the garden, and the kid's gardens, and my mum's garden, and
    all those holidays :)

    But some optimisations. You'll remember in Bonita's first version the
    bitmap was initialised to 0xaaaa, because it's a waste of time doing the
    sieve for 2.

    I pointed out that we don't need to even store the even numbers.

    But there's more.

    If you look at the bitmap when you've sieved for 2 you see

    12 34 56 78
    11 10 10 10...
    which is a repeat of 2 after an initialisation word. That's the aaaa.

    You can do the same with 3

    123 456 789
    111 110 110 110

    except this time the repeat is 3. And annoyingly that doesn't map well
    down onto a byte-based architecture. You end up with an initial word,
    then a 3-word repeat. (If your word was 24 or 36 bytes it would only be
    1 word, but I haven't seen that since the 1970s)

    In hex, with lowest byte first, that is
    fd b6 6d db b6 6d db

    (That BTW is the same if you only store the odd numbers - a 3-word repeat.)

    So rather than start off with your bitmap all set to 1s you can set it
    to this repeating pattern. That replaces all the ANDs for all the values
    of three with a memory fill with far fewer accesses.

    You can do the same with 5:

    12345 6789a bcdef
    11111 11110 11110 11110

    You can then AND your pattern for 3 with the one for 5, and get one with
    a repeat length of 15, and set that into your bitmap. You've now
    replaced about a fifth of all your AND operations with a flood fill.

    This can carry on - for a while. Only a short while. You _can_ make a
    sequence for lots of primes. But it gets quite long, quite quickly. For
    up to 23, and not storing the evens, it's over 1E8 words long!

    I was implementing a version of that when something else occurred to me.
    You can sacrifice speed for store size if you're prepared to do an
    integer divide for every prime lookup.

    Group our numbers into 6s like this

    012345
    6789ab
    cdefgh (YKWIM)
    ijklmn

    Mask out the ones divisible by 2 or 3 and you see

    0123-5
    -7---b
    -d---h
    -i---n

    Except for the first "page" the pattern is identical. Your algorithm can be: Divide your candidate by 6.
    If the result is zero look up in the page zero table 0123-5
    If it is non-zero index into a translation table like this
    010002
    If the translation is zero the number is not prime.
    If it is not zero it is the index of the bit for this page which tells
    me if the number is prime. And there need only be 2 bits for 6 numbers.

    (6 is the product of the first primes 2 and 3)

    It's still worth masking out the even numbers - they don't need the
    divide, a simple AND detects them - and I suspect it's worth using a
    much larger number than 6. 3*5*11*13 is 15,015, which seems as though it
    might be a convenient size. And only about a third of the number aren't
    known to be primes (5760 to be exact)

    I might play with that idea some time.

    But a note for the group of course - optimising this to the max has
    nothing to do whatever with C++. The only C++ optimising I've found
    myself doing is using raw pointers, not vector's operator[]. (certainly
    not the at function). And also I found myself using emplace_back a lot.
    It's a PITA because you can only emplace back a single item, and it is slow.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Fri Feb 16 08:06:28 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I retired last year, and I haven't really written any code since. This
    has turned out to be quite a fun little thing of a type I haven't had
    time for for YEARS. And oddly I still don't seem to have enough time
    for it... It's the garden, and the kid's gardens, and my mum's garden,
    and all those holidays :)

    But some optimisations. You'll remember in Bonita's first version the
    bitmap was initialised to 0xaaaa, because it's a waste of time doing
    the sieve for 2.

    I pointed out that we don't need to even store the even numbers.

    But there's more.

    If you look at the bitmap when you've sieved for 2 you see

    12 34 56 78
    11 10 10 10...
    which is a repeat of 2 after an initialisation word. That's the aaaa.

    You can do the same with 3

    123 456 789
    111 110 110 110

    except this time the repeat is 3. And annoyingly that doesn't map well
    down onto a byte-based architecture. You end up with an initial word,
    then a 3-word repeat. (If your word was 24 or 36 bytes it would only
    be 1 word, but I haven't seen that since the 1970s)

    In hex, with lowest byte first, that is
    fd b6 6d db b6 6d db

    (That BTW is the same if you only store the odd numbers - a 3-word repeat.)

    So rather than start off with your bitmap all set to 1s you can set it
    to this repeating pattern. That replaces all the ANDs for all the
    values of three with a memory fill with far fewer accesses.

    You can do the same with 5:

    12345 6789a bcdef
    11111 11110 11110 11110

    You can then AND your pattern for 3 with the one for 5, and get one
    with a repeat length of 15, and set that into your bitmap. You've now replaced about a fifth of all your AND operations with a flood fill.

    This can carry on - for a while. Only a short while. You _can_ make a sequence for lots of primes. But it gets quite long, quite
    quickly. For up to 23, and not storing the evens, it's over 1E8 words
    long!

    I was implementing a version of that when something else occurred to
    me. You can sacrifice speed for store size if you're prepared to do an integer divide for every prime lookup.

    I explained before how numbers can be considered mod 30, of which
    only the residues { 1, 7, 11, 13, 17, 19, 23, 29 } are candidates,
    because all other residues are divisible by 2, 3, or 5. And very
    conveniently, there are exactly 8 of these residues, so one byte
    can be used to represent each block of 30 numbers (only 8 of which
    might be prime). That cuts down on the space needed.

    I also explained how the divides and remainders can be avoided,
    by taking advantage of the structure of how the numbers being
    considered are represented, and arranging that the difficult parts
    be done at compile time.

    I have an implementation (written in C) based on this approach that
    determines all primes less than roughly 51.75 billion, taking just
    under 56 seconds to complete. (No threading is used - code is all
    single threaded.)

    [...]

    But a note for the group of course - optimising this to the max has
    nothing to do whatever with C++. The only C++ optimising I've found
    myself doing is using raw pointers, not vector's
    operator[]. (certainly not the at function). And also I found myself
    using emplace_back a lot. It's a PITA because you can only emplace
    back a single item, and it is slow.

    My program is written in C and uses ordinary C arrays. I expect it
    could be converted to C++ without too much difficulty.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Fri Feb 23 05:51:10 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 16.02.2024 um 17:06 schrieb Tim Rentsch:

    I have an implementation (written in C) based on this approach that
    determines all primes less than roughly 51.75 billion, taking just
    under 56 seconds to complete. (No threading is used - code is all
    single threaded.)

    On my 16 core PC this takes 1.73 seconds and 43 seconds
    overall CPU time without printing the numbers to a file.

    C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 51750000000 ""
    real 1.729s
    user 43.094s
    sys 0.094s
    cylces 194.738.953.589

    If you run the program as a single thread, what is the
    elapsed time?

    Also how long does the program take to determine all
    primes up to 3 trillion? Here again I am interested
    in the single-thread version.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Sun Feb 25 00:48:59 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 23.02.2024 um 14:51 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 16.02.2024 um 17:06 schrieb Tim Rentsch:

    I have an implementation (written in C) based on this approach that
    determines all primes less than roughly 51.75 billion, taking just
    under 56 seconds to complete. (No threading is used - code is all
    single threaded.)

    On my 16 core PC this takes 1.73 seconds and 43 seconds
    overall CPU time without printing the numbers to a file.

    C:\Users\Boni\Documents\Source\bitmapSieve>timep
    "x64\Release-clang++\bitmapSieve.exe" 51750000000 ""
    real 1.729s
    user 43.094s
    sys 0.094s
    cylces 194.738.953.589

    If you run the program as a single thread, what is the
    elapsed time?

    C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 51750000000 "" 1
    real 22.945s
    user 22.672s
    sys 0.234s
    cylces 102.917.207.295

    Hmmm. Interesting that the multi-threaded version uses almost
    most twice as much CPU as the single-threaded version. I might
    have guessed more, but not twice as much.

    Also how long does the program take to determine all
    primes up to 3 trillion? Here again I am interested
    in the single-thread version.

    C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 3000000000 "" 1
    real 1.234s
    user 1.203s
    sys 0.031s
    cylces 5.523.677.002

    What I asked for was 3 trillion with a T, not 3 billion with
    a B. Not 3000000000 but 3000000000000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Mon Mar 11 10:10:40 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 25.02.2024 um 09:48 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 23.02.2024 um 14:51 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 16.02.2024 um 17:06 schrieb Tim Rentsch:

    [...]

    Also how long does the program take to determine all
    primes up to 3 trillion? Here again I am interested
    in the single-thread version.

    C:\Users\Boni\Documents\Source\bitmapSieve>timep
    "x64\Release-clang++\bitmapSieve.exe" 3000000000 "" 1
    real 1.234s
    user 1.203s
    sys 0.031s
    cylces 5.523.677.002

    What I asked for was 3 trillion with a T, not 3 billion with
    a B. Not 3000000000 but 3000000000000.

    That would in a bitmap of 375GiB, which won't fit into my memory.

    Sounds like you're using 1 bit per number, most of which are
    wasted. If you use a different encoding the memory requirements
    can be much smaller. How much memory do you have on the box?
    If you have 64G you should be able to determine all primes
    less than 1.5 trillion, using a simple encoding.

    I've mentioned this encoding before but let me give it again.
    If numbers are considered mod 30, there are only 8 residues
    that are not divisible by 2, 3, or 5. The 8 residues are
    1, 7, 11, 13, 17, 19, 23, and 29. So a single byte can
    hold all the information needed for 30 numbers, which means

    1500000000000 / 30 = 50000000000

    which is to say 50 gigabytes should suffice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Sat Apr 20 08:35:23 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 11.03.2024 um 18:10 schrieb Tim Rentsch:

    Sounds like you're using 1 bit per number, most of which are
    wasted. If you use a different encoding the memory requirements
    can be much smaller. How much memory do you have on the box?
    If you have 64G you should be able to determine all primes
    less than 1.5 trillion, using a simple encoding.

    I'm omitting even numbers and I handle the number two as a
    special case; that's the fastest solution.

    I've mentioned this encoding before but let me give it again.
    If numbers are considered mod 30, there are only 8 residues
    that are not divisible by 2, 3, or 5. The 8 residues are
    1, 7, 11, 13, 17, 19, 23, and 29. So a single byte can
    hold all the information needed for 30 numbers, which means

    1500000000000 / 30 = 50000000000

    which is to say 50 gigabytes should suffice.

    Show me the code.

    Apparently you have missed the point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Wed Apr 24 12:28:18 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 20.04.2024 um 17:35 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    [...]
    Show me the code.

    Apparently you have missed the point.

    I want to see the code for your idea.

    Yes I already understood what you want. That is what
    led me to conclude that you have missed the point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Bonita Montero on Thu Apr 25 14:14:52 2024
    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 24.04.2024 um 21:28 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 20.04.2024 um 17:35 schrieb Tim Rentsch:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    [...]

    Show me the code.

    Apparently you have missed the point.

    I want to see the code for your idea.

    Yes I already understood what you want. That is what
    led me to conclude that you have missed the point.

    I don't have "missed the point"; [...]

    There is more than one school of thought on that question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)