• Re: Sieve of Erastosthenes optimized to the max

    From Vir Campestris@21:1/5 to Bonita Montero on Sun Dec 10 21:48:16 2023
    On 10/12/2023 09:46, Bonita Montero wrote:
    I've parallelized my sieve of Erastosthenes to all cores. The parallel threads do al the punching of the non-prime multiplex beyond sqrt( max
    ). I found that the algorithm is limited by the memory bandwith since
    the bit-range between sqrt( max ) and max is to large to fit inside
    the cache. So I partitioned the each thread to a number of bit-ranges
    that fits inside the L2-cache. With that I got a speedup of factor 28
    when calculating about the first 210 million primes, i.e. all primes
    <= 2 ^ 32.
    On my Zen4 16 core PC under Windows with clang-cl 16.0.5 producing the
    primes without any file output is only 0.28s. On my Linux-PC, a Zen2 64
    core CPU producing the same number of primes is about 0.09s.
    The file output is done with my own buffering. With that writing the mentioned prime number range results in a 2.2GB file which is written
    in about two seconds with a PCIe 4.0 SSD.
    The first parameter is the highest prime candidate to be generated,
    it can be prefixed with 0x for hex values. The second number is the
    name of the output file; it can be "" to suppress file output. The
    third parameter is the number of punching threads; if it is left the
    number of threads defaults to the number of threads reported by the
    runtime.

    <snip code>

    Just so happens I've been thinking about this. I find your code
    impenetrable - there are no comments at all. But two things occurred to
    me quite quickly:
    - You don't need to store the odd numbers
    - You only need a bitmask to save the sieve.

    I think you're doing the latter, though it's not obvious. I think you're storing bits in an array of size_t, and that will be what
    0xAAAAAAAAAAAAAAACu and 0xAAAAAAAAAAAAAAAAu are about. However if I am
    reading that right you've assumed that unsigned is the same size as
    size_t, and that the size is 64 bits.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Mon Dec 11 17:12:00 2023
    On 11/12/2023 03:15, Bonita Montero wrote:
    Am 10.12.2023 um 22:48 schrieb Vir Campestris:

    - You don't need to store the odd numbers

    All primes except 2 are odd.

    I had a really bad night, and I was half asleep yesterday.

    Because all primes except 2 are odd you don't need to store the _even_
    numbers, which is what I meant to say. Or else you only need to store
    the odd ones.

    - You only need a bitmask to save the sieve.

    I'm using the "scan"-lambda which does serarch through the sieve
    as fast as possible with countr_zero, which maps to a single CPU
    -instruction on modern processors.

    ... reading that right you've assumed that unsigned is the same
    size as size_t, and that the size is 64 bits.

    I've got a type word_t that is a size_t, but it can also
    be a uint8_t to test what's the performance difference.


    Your constants are 64 bit hexadecimal numbers (from counting the digits)
    and are unsigned (the u suffix) but there's no guarantee that size_t
    will be the same size as unsigned, nor that it will be 64 bit.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Vir Campestris on Wed Dec 13 15:25:42 2023
    On 13/12/2023 15:16, Vir Campestris wrote:
    Well, I went away and tried it. I didn't write anything nearly as sophisticated as yours - I didn't worry about threads and caching. The
    code follows. You'll have to unwrap it in a few places.

    It occurred to me I could go further - not just optimise it out to store
    odd numbers only, but also miss out the ones divisible by 3, or other
    higher primes. The results were:
    - Storing all numbers: 9.494 seconds to run the sieve
    - Storing odd numbers only: 4.455. That's over twice as fast.
    - Excluding also the ones divisible by 3: 3.468. That's an improvement,
    but not as much as I expected. Perhaps it's got coo complicated. I don't
    have a high powered PC.

    And no sooner had I posted that than I realised I'd got the optimisation settings wrong.

    With -Ofast fed to g++ the version that doesn't store multiples of 3 is _slower_ than the odds only one!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Wed Dec 13 15:16:17 2023
    On 11/12/2023 17:19, Bonita Montero wrote:
    Am 11.12.2023 um 18:12 schrieb Vir Campestris:

    Because all primes except 2 are odd you don't need to store the _even_
    numbers, which is what I meant to say. Or else you only need to store
    the odd ones.

    I'll try that the next days; but I don't expect a significant change.

    Well, I went away and tried it. I didn't write anything nearly as
    sophisticated as yours - I didn't worry about threads and caching. The
    code follows. You'll have to unwrap it in a few places.

    It occurred to me I could go further - not just optimise it out to store
    odd numbers only, but also miss out the ones divisible by 3, or other
    higher primes. The results were:
    - Storing all numbers: 9.494 seconds to run the sieve
    - Storing odd numbers only: 4.455. That's over twice as fast.
    - Excluding also the ones divisible by 3: 3.468. That's an improvement,
    but not as much as I expected. Perhaps it's got coo complicated. I don't
    have a high powered PC.

    Your constants are 64 bit hexadecimal numbers (from counting the
    digits) and are unsigned (the u suffix) but there's no guarantee that
    size_t will be the same size as unsigned, nor that it will be 64 bit.

    The constants are wide enough for 64 bit size_t-s but it won't make
    a functional difference if you define word_t as a uint8_t. With an
    uint8_t word_t the code runs about 1/6-th slower; that's less than
    I expected.

    Most likely what you are seeing there is that you've gone from being
    memory bound to being compute bound. The CPU cache will make sure the
    main memory accesses are 128 bits (or whatever your bus width is)

    Andy
    --
    #include <cassert>
    #include <chrono>
    #include <cmath>
    #include <iostream>
    #include <thread>
    #include <vector>

    size_t PrimeCount = 1e8;

    // this is a convenience class for timing.
    class stopwatch {
    std::chrono::time_point<std::chrono::steady_clock> mBegin, mEnd; public:
    stopwatch() {}

    // record the start of an interval
    void start()
    {
    mBegin = std::chrono::steady_clock::now();
    }

    // record the end of an interval
    void stop()
    {
    mEnd = std::chrono::steady_clock::now();
    }

    // report the difference between the last start and stop
    double delta() const
    {
    return std::chrono::duration_cast<std::chrono::milliseconds>(mEnd -
    mBegin).count() / 1000.0;
    }
    };

    // the bitmask will be stores in a class that looks like this
    class Primes
    {
    public:
    Primes(size_t size) {};
    virtual ~Primes() {};
    virtual void unprime(size_t value) = 0; // marks a number
    as not prime
    virtual bool get(size_t value) const = 0; // gets how a
    number is marked
    virtual size_t size() const = 0; // Size of the store
    };

    // Basic store. Stores all the primes in a vector<bool>
    class PrimesVB: public Primes
    {
    std::vector<bool> mStore;
    public:
    PrimesVB(size_t size) : Primes(size), mStore(size, true) {};
    virtual void unprime(size_t value) override
    {
    mStore[value] = false;
    }

    virtual bool get(size_t value) const override
    {
    return mStore[value];
    }

    virtual size_t size() const override
    {
    return mStore.size();
    }
    };

    class Primes2: public PrimesVB
    {
    size_t mSize;
    public:
    Primes2(size_t size) : PrimesVB((size + 1) / 2), mSize(size) {};
    virtual void unprime(size_t value) override
    {
    if (value & 1)
    PrimesVB::unprime(value / 2);
    }

    virtual bool get(size_t value) const override
    {
    if (value <= 2) return true;
    return (value & 1) && PrimesVB::get(value / 2);
    }

    virtual size_t size() const override
    {
    return mSize;
    }
    };

    class Primes23: public PrimesVB
    {
    // Size of the map. This is a lot bigger than the size of the
    bitmap.
    size_t mSize;

    // each "page" contains 2 "lines". A line is a prime we are
    storing.
    // this map is where we store each number, depending on its
    value modulo 6.
    // -1 is known not to be prime, and isn't stored.
    // we store 7, 13, 18 in "line" 0; 11, 17, 23 in "line" 1.
    // the others are either divisible by 2 or 3, and don't need to
    be stored.
    static constexpr int mPosMap[6] = {-1, 0, -1, -1, -1, 1};
    public:
    Primes23(size_t size) : PrimesVB((size + 2) / 3), mSize(size) {};
    virtual void unprime(size_t value) override
    {
    if (value <= 3) return;
    auto page = value / 6;
    auto line = mPosMap[value % 6];
    if (line >= 0)
    {
    PrimesVB::unprime(page * 2 + line);
    }
    }

    virtual bool get(size_t value) const override
    {
    if (value <= 3) return true;

    auto page = value / 6; // 5=0 7=1 11=1 13=2
    auto line = mPosMap[value % 6]; // 5=2 7=0 11=2 13=0
    if (line >= 0)
    {
    return PrimesVB::get(page * 2 + line);
    }
    else
    {
    return false;
    }
    }

    virtual size_t size() const override
    {
    return mSize;
    }
    };


    // Simple sieve of Eratosthenes function
    void sieve(Primes& store)
    {
    const size_t storesize = store.size();
    const size_t maxfilter = std::ceil(std::sqrt(storesize));

    size_t prime = 2;
    while (prime < maxfilter)
    {
    // Unmark all the multiples
    for (size_t inval = prime*2; inval < storesize; inval+=
    prime)
    store.unprime(inval);

    // Find the next prime to filter with
    while (!store.get(++prime) && prime < maxfilter) {};
    }
    }

    // Benchmark function. Returns the constructed collection for validation. template <typename storetype> storetype test(const char* classname)
    {
    stopwatch s;
    s.start();
    storetype store(PrimeCount);
    s.stop();
    std::cout << classname << " construct " << s.delta() << "
    seconds." << std::endl;

    s.start();
    sieve(store);
    s.stop();
    std::cout << classname << " sieve " << s.delta() << " seconds."
    << std::endl;

    return store;
    }

    int main()
    {
    auto vb = test<PrimesVB>("vector<bool>");
    auto p2 = test<Primes2>("Storing odds only");
    auto p23 = test<Primes23>("Not storing 2s and 3s");

    // double check they all match.
    for (auto p = 1; p < PrimeCount; ++p)
    {
    assert(vb.get(p) == p2.get(p));
    assert(vb.get(p) == p23.get(p));
    }

    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to All on Thu Dec 14 15:06:14 2023
    I tried a few more collections.

    The slowest on my system is vector<bool> 12.165s for 1e9 primes.

    Having a manual bitmask, and storing in uint64_t is a lot faster -
    almost twice as fast (6.659).

    A uint32_t is a little faster than that (6.306).

    Optimising it to store odds only more than doubles the speed of
    vector<bool> to 5.107 seconds.

    That has less effect with uint64_t, taking it to 4.946 - which is the
    best time I have.

    Oddly storing odds only with uint32_t is not as fast as odds only with uint64_t, although it is still faster than storing all the values (5.302).

    I've optimised it to do less recursion, which has helped, but it still
    isn't worth not storing the multiples of 3. I'll guess that's because
    that code required a divide and mod by 6, whereas optimising out the
    evens only requires shifts and masks.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to Vir Campestris on Thu Dec 14 08:20:19 2023
    On 12/13/2023 7:16 AM, Vir Campestris wrote:
    On 11/12/2023 17:19, Bonita Montero wrote:
    Am 11.12.2023 um 18:12 schrieb Vir Campestris:

    Because all primes except 2 are odd you don't need to store the
    _even_ numbers, which is what I meant to say. Or else you only need
    to store the odd ones.

    I'll try that the next days; but I don't expect a significant change.

    Well, I went away and tried it. I didn't write anything nearly as sophisticated as yours - I didn't worry about threads and caching. The
    code follows. You'll have to unwrap it in a few places.

    It occurred to me I could go further - not just optimise it out to store
    odd numbers only, but also miss out the ones divisible by 3, or other
    higher primes. The results were:
    - Storing all numbers: 9.494 seconds to run the sieve
    - Storing odd numbers only: 4.455. That's over twice as fast.
    - Excluding also the ones divisible by 3: 3.468. That's an improvement,
    but not as much as I expected. Perhaps it's got coo complicated. I don't
    have a high powered PC.

    Another way to attempt to optimize is that except for 2 and 3, all
    primes are of the form 6n+1 or 6n-1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Thu Dec 21 15:30:12 2023
    On 20/12/2023 12:44, Bonita Montero wrote:
    Now the code runs 0.15s on the 16 core Zen4 system, that's +87% faster.
    On th4e 64 core Zen2 system the time to produce the first 21 million
    primes is about 0.05 seconds.

    I thought I'd try it.

    Can I make a suggestion? Where it says

    if( argc < 2 )
    return EXIT_FAILURE;

    make it instead print a user guide?

    On my system my code takes about 20 seconds to produce 1e9 primes. It's
    single threaded. Yours is taking a little over 6, but about 18 seconds
    of CPU time. (I have 4 cores, each with 2 threads. Mine is designed to
    be cool and quiet...)

    I've obviously got something wrong though - I'm using a _lot_ more
    memory than you.

    Out of interest - I thought you must be Spanish from your name. And yet
    you speak fluent German?

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to red floyd on Sat Dec 23 10:30:21 2023
    red floyd <no.spam.here@its.invalid> writes:

    On 12/13/2023 7:16 AM, Vir Campestris wrote:

    On 11/12/2023 17:19, Bonita Montero wrote:

    Am 11.12.2023 um 18:12 schrieb Vir Campestris:

    Because all primes except 2 are odd you don't need to store the
    _even_ numbers, which is what I meant to say. Or else you only
    need to store the odd ones.

    I'll try that the next days; but I don't expect a significant change.

    Well, I went away and tried it. I didn't write anything nearly as
    sophisticated as yours - I didn't worry about threads and
    caching. The code follows. You'll have to unwrap it in a few places.

    It occurred to me I could go further - not just optimise it out to
    store odd numbers only, but also miss out the ones divisible by 3,
    or other higher primes. The results were:
    - Storing all numbers: 9.494 seconds to run the sieve
    - Storing odd numbers only: 4.455. That's over twice as fast.
    - Excluding also the ones divisible by 3: 3.468. That's an
    improvement, but not as much as I expected. Perhaps it's got coo
    complicated. I don't have a high powered PC.

    Another way to attempt to optimize is that except for 2 and 3, all
    primes are of the form 6n+1 or 6n-1.

    A more compact form is to consider numbers mod 30, in which case
    numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
    there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
    so each block of 30 can be represented in one 8-bit byte. Doing
    that gives a 25% reduction in space compared to a 6n+/-1 scheme.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sat Dec 23 10:21:04 2023
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On my system my code takes about 20 seconds to produce 1e9
    primes. [...]

    Do you mean 1e9 primes or just the primes less than 1e9? To
    do the first 1e9 primes a sieve would need to go up to about
    23.9e9 (so half that many bits if only odd numbers were
    represented).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Sat Dec 23 21:20:42 2023
    On 23/12/2023 18:30, Tim Rentsch wrote:
    A more compact form is to consider numbers mod 30, in which case
    numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
    there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
    so each block of 30 can be represented in one 8-bit byte. Doing
    that gives a 25% reduction in space compared to a 6n+/-1 scheme.

    I found that on my system the modulo operation was so slow this wasn't
    worth doing.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Sat Dec 23 21:21:21 2023
    On 23/12/2023 18:21, Tim Rentsch wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On my system my code takes about 20 seconds to produce 1e9
    primes. [...]

    Do you mean 1e9 primes or just the primes less than 1e9? To
    do the first 1e9 primes a sieve would need to go up to about
    23.9e9 (so half that many bits if only odd numbers were
    represented).

    Primes up to 1e9.

    I have another idea though, watch this space...

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sun Dec 24 00:36:42 2023
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 23/12/2023 18:30, Tim Rentsch wrote:

    A more compact form is to consider numbers mod 30, in which case
    numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
    there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
    so each block of 30 can be represented in one 8-bit byte. Doing
    that gives a 25% reduction in space compared to a 6n+/-1 scheme.

    I found that on my system the modulo operation was so slow this wasn't
    worth doing.

    Depending on how the code is written, no modulo operations need
    to be done, because they will be optimized away and done at
    compile time. If we look at multiplying two numbers represented
    by bits in bytes i and j, the two numbers are

    i*30 + a
    j*30 + b

    for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.

    The values we're interested in are the index of the product and
    the residue of the product, namely

    (i*30+a) * (j*30+b) / 30 (for the index)
    (i*30+a) * (j*30+b) % 30 (for the residue)

    Any term with a *30 in the numerator doesn't contribute to
    the residue, and also can be combined with the by-30 divide
    for computing the index. Thus these expressions can be
    rewritten as

    i*30*j + i*b + j*a + (a*b/30) (for the index)
    a*b%30 (for the residue)

    When a and b have values that are known at compile time,
    neither the divide nor the remainder result in run-time
    operations being done; all of that heavy lifting is
    optimized away and done at compile time. Of course there
    are some multiplies, but they are cheaper than divides, and
    also can be done in parallel. (The multiplication a*b also
    can be done at compile time.)

    The residue needs to be turned into a bit mask to do the
    logical operation on the byte of bits, but here again that
    computation can be optimized away and done at compile time.

    Does that all make sense?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sun Dec 24 10:49:51 2023
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 23/12/2023 18:21, Tim Rentsch wrote:

    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On my system my code takes about 20 seconds to produce 1e9
    primes. [...]

    Do you mean 1e9 primes or just the primes less than 1e9? To
    do the first 1e9 primes a sieve would need to go up to about
    23.9e9 (so half that many bits if only odd numbers were
    represented).

    Primes up to 1e9.

    I have another idea though, watch this space...

    Does your have enough memory to compute all the primes up
    to 24e9? If it does I suggest that for your next milestone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Chris M. Thomasson on Tue Dec 26 23:35:11 2023
    On 2023-12-26, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 12/26/2023 1:27 AM, Bonita Montero wrote:
    Am 26.12.2023 um 06:39 schrieb Chris M. Thomasson:

    Remember when Intel first started hyperthreading and god damn threads
    could false share with each other (on the stacks!) the low and high 64
    byte parts of the 128 byte cache lines? A workaround was to
    artificially offset the threads stacks via alloca.

    If you have one core and two threads there's no false sharing.

    So, are you familiar with Intel's early hyper threading problem? There
    was false sharing between the hyperhtreads. The workaround did improve performance by quite a bit. IIRC, my older appcore project had this workaround incorporated into it logic. I wrote that sucker back in very
    early 2000's. Humm... I will try to find the exact line.

    Could you be both right? The performance problem was real, but maybe
    it wasn't "false sharing"? The hyper-thread are the same core; they
    share the same caches.

    Might you be describing a cache collision rather than false sharing?

    A single processor can trigger degenerate cache uses (at any level
    of a cache hierarchy). For instance, if pages of virtual memory
    are accessed in certain patterns, they can trash the same TLB entry.

    Was it perhaps the case that these thread stacks were allocated at such
    a stride, that their addresses clashed on the same cache line set?

    That could be a problem even on one processor, but it's obvious that hyperthreading could exacerbate it because switches between hyperthreads
    happen on a fine granularity. They don't get to run a full
    scheduler-driven time quantum. A low-level processor event like a
    pipeline stall (or other resource issue) can drive a hyper thread
    switch.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Wed Dec 27 20:49:06 2023
    On 2023-12-27, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 27.12.2023 um 06:59 schrieb Chris M. Thomasson:

    Something about false interference between threads that should not even
    be interacting with one another to begin with. It was a problem. The fix
    was based on alloca, so that is something to ponder on.;

    Like with any SMT-core there could be cache-thrashing between the
    cores. The L1 data cache was only 8kB, maybe only two was associative
    that could be thrashing between the cores. But I've no clue what this
    would have to to with alloca().

    Say you have thread stacks that are offset by some power of two amount,
    like a megabyte. Stack addresses at the same depth (like the local
    variables of threads executing the same function) are likely to collide
    on the same cache set (at different levels of the caching hierachy: L1,
    L2, translation caches).

    With alloca, since it moves the stack pointer, we can carve variable
    amounts of stack space to randomize the stack offsets (before calling
    the work functions). In different threads, we use differently-sized
    alloca allocations.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Fri Dec 29 13:56:54 2023
    On 29/12/2023 10:58, Bonita Montero wrote:
    Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:

    On 12/28/2023 7:17 PM, Bonita Montero wrote:

    Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:

    The use of alloca to try to get around the problem in their
    (Intel's) early hyperthreaded processors was real, and actually
    helped. It was in the Intel docs.

    I don't believe it.

    Why not?

    Because I don't understand what's different with the access pattern
    of alloca() and usual stack allocation. And if I google for "alloca
    Pentium 4 site:intel.com" I can't find anything that fits.


    There is nothing different from alloca() and ordinary stack allocations.
    But alloca() makes it quick and easy to make large allocations, and to
    do so with random sizes (or at least sizes that differ significantly
    between threads, even if they are running the same code). Without
    alloca(), you'd need to do something like arranging to call a recursive function a random number of times before it then calls the next bit of
    code in your thread. alloca() is simply far easier and faster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Fri Dec 29 16:01:47 2023
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:

    On 12/28/2023 7:17 PM, Bonita Montero wrote:

    Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:

    The use of alloca to try to get around the problem in their (Intel's)
    early hyperthreaded processors was real, and actually helped. It was
    in the Intel docs.

    I don't believe it.

    Why not?

    Because I don't understand what's different with the access pattern
    of alloca() and usual stack allocation. And if I google for "alloca
    Pentium 4 site:intel.com" I can't find anything that fits.


    See page 2-35 in the _Intel Pentium 4 Processor Optimization_
    manual.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Fri Dec 29 18:03:54 2023
    On 24/12/2023 08:36, Tim Rentsch wrote:
    Depending on how the code is written, no modulo operations need
    to be done, because they will be optimized away and done at
    compile time. If we look at multiplying two numbers represented
    by bits in bytes i and j, the two numbers are

    i*30 + a
    j*30 + b

    for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.

    The values we're interested in are the index of the product and
    the residue of the product, namely

    (i*30+a) * (j*30+b) / 30 (for the index)
    (i*30+a) * (j*30+b) % 30 (for the residue)

    Any term with a *30 in the numerator doesn't contribute to
    the residue, and also can be combined with the by-30 divide
    for computing the index. Thus these expressions can be
    rewritten as

    i*30*j + i*b + j*a + (a*b/30) (for the index)
    a*b%30 (for the residue)

    When a and b have values that are known at compile time,
    neither the divide nor the remainder result in run-time
    operations being done; all of that heavy lifting is
    optimized away and done at compile time. Of course there
    are some multiplies, but they are cheaper than divides, and
    also can be done in parallel. (The multiplication a*b also
    can be done at compile time.)

    The residue needs to be turned into a bit mask to do the
    logical operation on the byte of bits, but here again that
    computation can be optimized away and done at compile time.

    Does that all make sense?

    Right now, no. But that's me. I'll flag it to read again when I've had a
    better night's sleep.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Fri Dec 29 17:29:00 2023
    On 2023-12-29, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:

    On 12/28/2023 7:17 PM, Bonita Montero wrote:

    Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:

    The use of alloca to try to get around the problem in their (Intel's)
    early hyperthreaded processors was real, and actually helped. It was
    in the Intel docs.

    I don't believe it.

    Why not?

    Because I don't understand what's different with the access pattern
    of alloca() and usual stack allocation. And if I google for "alloca
    Pentium 4 site:intel.com" I can't find anything that fits.

    I explained it. The allocation is not used. When you call alloca(n),
    the stack pointer moves by n bytes. If you then call a function,
    its stack frame will be offset by that much (plus any alignment if
    n is not aligned).


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Chris M. Thomasson on Sat Dec 30 04:51:47 2023
    On 2023-12-29, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 12/29/2023 2:09 PM, Chris M. Thomasson wrote:
    On 12/29/2023 8:01 AM, Scott Lurndal wrote:
    page 2-35 in the_Intel Pentium 4 Processor Optimization_
    manual.

    I think it was in chapter 5 of Developing Multithreaded Applications: A
    Platform Consistent Approach cannot remember the damn section right now.

    Wait a minute! I might have found it, lets see:

    https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

    Ahhh section 5.3! Nice! I read this a while back, before 2005.

    Wow, I guessed that one. Elsewhere in the thread, I made a remark
    similar to "imagine that thread stacks are aligned at an address like
    nnnnFFFF" I.e. the top of the stack starts at the top of a 64 kB
    aligned window. I was exactly thinking of a typical L1 cache size
    from aroudn that era, in fact.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sat Dec 30 04:56:19 2023
    On 2023-12-30, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 29.12.2023 um 23:12 schrieb Chris M. Thomasson:

    Wait a minute! I might have found it, lets see:
    https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
    Ahhh section 5.3! Nice! I read this a while back, before 2005.

    This is nonsense what it says if the cache is really four-way
    associative like in the other paper mentioned here. And of
    course that has nothing to do with false sharing

    If it's four way associative, you star to have a performance problem as
    soon as five things collide on it. For instance, suppose you have two
    thread stack tops mapped to the same cache lines, plus three more data structures heavily being accessed.

    Oh, and the above Intel paper does actually mention alloca:

    "The easiest way to adjust the initial stack address for each thread is
    to call the memory allocation function, _alloca, with varying byte
    amounts in the intermediate thread function."

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sat Dec 30 05:51:51 2023
    On 2023-12-30, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 30.12.2023 um 05:56 schrieb Kaz Kylheku:

    If it's four way associative, you star to have a performance problem as
    soon as five things collide on it. ...

    With four-way associativity that's rather unlikely.

    Under four-way associativity, five isn't greater than four?

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Sat Dec 30 19:27:30 2023
    On 29/12/2023 17:04, Bonita Montero wrote:
    Am 29.12.2023 um 13:56 schrieb David Brown:

    There is nothing different from alloca() and ordinary stack
    allocations.  But alloca() makes it quick and easy to make large
    allocations, and to do so with random sizes (or at least sizes that
    differ significantly between threads, even if they are running the
    same code).

    I've got my own class overflow_array<> which is like an array<> and a vector<>. If you append more than the internal array<> can handle the
    objects are moved to an internal vector. I think Boost's small_array
    is similar to that.

    I'm sure that's all very nice, but it is completely and utterly
    irrelevant to the issue being discussed. Perhaps you didn't understand
    your own post?


    Without  alloca(), you'd need to do something like arranging to call a
    recursive  function a random number of times before it then calls the
    next bit of  code in your thread.  alloca() is simply far easier and
    faster.

    You've got strange ideas. alloca() has been completely removed from the
    Linux kernel.

    Citation? You are usually wrong in your claims, or at least mixed-up,
    so I won't trust you without evidence. (That does not mean you are
    wrong here - I don't know either way.)

    Of course, avoiding alloca() within the kernel is, again, utterly
    irrelevant to the point under discussion.

    The point is that if there's a fixed upper limit you would
    allocate you could allocate it always statically.


    No, that would be useless.

    The idea is to have /different/ allocation sizes in different threads,
    so that the different threads have their stacks at wildly different
    address bits in their tags for the processor caches.

    I can't tell you how helpful or not this may be in practice. I am
    merely trying to explain to you what the idea is, since you said you did
    not understand it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sat Dec 30 20:35:03 2023
    On 2023-12-30, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 30.12.2023 um 06:51 schrieb Kaz Kylheku:

    Under four-way associativity, five isn't greater than four?

    Two-way associativeness would leave no space if both threads have
    synchronous stack frames. With four-way associativeness there's
    not much likehood for that to happen.

    My comment makes it clear that there are two thread stacks vying
    for that cache line, plus a couple of other accesses that
    are not thread stacks.

    (By the way, set associative caches don't always have full LRU
    replacement strategies.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to Chris M. Thomasson on Sat Dec 30 14:58:06 2023
    On 12/30/2023 11:58 AM, Chris M. Thomasson wrote:
    On 12/29/2023 8:45 PM, Bonita Montero wrote:
    Am 29.12.2023 um 18:29 schrieb Kaz Kylheku:

    I explained it. The allocation is not used. When you call alloca(n),
    the stack pointer moves by n bytes. If you then call a function,
    its stack frame will be offset by that much (plus any alignment if
    n is not aligned).

    According to the paper Scott mentioned the associativity of the
    Pentium 4's L1 data cache is four. With that it's not necessary
    to have such aliasing preventions.


    Huh? Wow, you really need to write Intel a letter about it wrt their
    older hyperthreaded processors! Although, it seems like you simply do
    not actually _understand_ what is going on here...

    Intel's suggestions for how to mitigate the problem in their earlier hyperhtreaded processors actually worked wrt improving performance. Keep
    in mind this was a while back in 2004-2005. I am happy that the way back machine has my older code.

    Oh, come on, Chris. It's clear that Bonita knows more about what's
    going on inside an Intel processor than Intel does.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sun Dec 31 07:01:07 2023
    On 2023-12-31, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 30.12.2023 um 21:35 schrieb Kaz Kylheku:

    My comment makes it clear that there are two thread stacks
    vying for that cache line, plus a couple of other accesses
    that are not thread stacks.

    With four way associativity that's rather unlikely to happen.

    What? The associativity of the cache is not the determiner of
    program behavior; if the program accesses five different
    areas whose addresses are the same modulo 65536 bytes,
    that happens whether there a direct mapped cache, fully
    associative cache or no cache at all, or ...




    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sun Dec 31 17:30:09 2023
    On 2023-12-31, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 31.12.2023 um 08:01 schrieb Kaz Kylheku:
    On 2023-12-31, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 30.12.2023 um 21:35 schrieb Kaz Kylheku:

    My comment makes it clear that there are two thread stacks
    vying for that cache line, plus a couple of other accesses
    that are not thread stacks.

    With four way associativity that's rather unlikely to happen.

    What? The associativity of the cache is not the determiner
    of program behavior; if the program accesses five different
    areas whose addresses are the same modulo 65536 bytes, ...

    The set size is smaller than 64kB an of course there can be aliasing
    but with four way associativity it's unlikely that there's a need to
    control aliasing. And if the set size would be 64kB aliasing would be
    even less likely through page colouring.

    When I describe a scenario in which five items are being frequently
    accessed and collide to the same cache line, which is associative with a
    set size of four, there is necessarily a conflict, because five is
    greater than four.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Sun Dec 31 18:44:54 2023
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 30.12.2023 um 05:56 schrieb Kaz Kylheku:

    If it's four way associative, you star to have a performance problem as
    soon as five things collide on it. ...

    With four-way associativity that's rather unlikely.


    Why? The set selection is based on specific bits in the address. As soon
    as a fifth address hits the set, you lose one of the existing lines.

    Given the aligned stacks in the specified processor, the next function
    called would _always_ overwrite the elements of the set(s) used by the calling function.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Sun Dec 31 18:49:22 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 29/12/2023 17:04, Bonita Montero wrote:


    You've got strange ideas. alloca() has been completely removed from the
    Linux kernel.

    Citation? You are usually wrong in your claims, or at least mixed-up,
    so I won't trust you without evidence. (That does not mean you are
    wrong here - I don't know either way.)

    Kernel threads generally run with a small, fixed stack. Stack-based
    dynamic allocation is generally avoided. I don't know of any
    hard restrictions, but I suspect that Linus would look askance
    at it.


    Of course, avoiding alloca() within the kernel is, again, utterly
    irrelevant to the point under discussion.

    Very true

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Mon Jan 1 08:28:57 2024
    On 2024-01-01, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 31.12.2023 um 19:44 schrieb Scott Lurndal:

    Why? ...

    Because there must be set-conflicts on the same line index.
    That's rather unlikely with four-way associativeness.

    It's rather likely. Suppose you have a large number of threads,
    with stacks that end on a multiple of 64 kB.

    Only two of these threads are runnable concurrently on the
    two hyperthreads. However, your application may be rapidly
    switching between them.

    If their stacks evacuate each other from the L1 cache,
    that could be bad.

    For compute-bound tasks with long quanta, it might not
    matter so much; the same threads will occupy the hyperthreads
    for tens of milliseconds at a time or whatever.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Mon Jan 1 12:46:39 2024
    On 31/12/2023 19:49, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 29/12/2023 17:04, Bonita Montero wrote:


    You've got strange ideas. alloca() has been completely removed from the
    Linux kernel.

    Citation? You are usually wrong in your claims, or at least mixed-up,
    so I won't trust you without evidence. (That does not mean you are
    wrong here - I don't know either way.)

    Kernel threads generally run with a small, fixed stack. Stack-based
    dynamic allocation is generally avoided. I don't know of any
    hard restrictions, but I suspect that Linus would look askance
    at it.


    Oh, I have no doubts that it would be frowned upon in the kernel itself.
    And I know that VLAs have been actively removed from the kernel. But Bonita's claim implies that "alloca()" was used in the kernel earlier,
    and has since been removed, presumably due to a specific decision.


    Of course, avoiding alloca() within the kernel is, again, utterly
    irrelevant to the point under discussion.

    Very true

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sat Jan 6 08:31:06 2024
    On 2024-01-06, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 06.01.2024 um 04:21 schrieb Chris M. Thomasson:

    Are you trying to tell me that the aliasing problem on those older Intel
    hyperthreaded processors and the workaround (from Intel) was a myth?
    lol. ;^)

    Intel just made a nerd-suggestion. With four-way associativity
    there's no frequent aliasing problem in the L1 data dache of
    Pentium 4.

    I think the L1 cache was 8K on that thing, and the blocks are 32 bytes.

    I think how it works on the P4 is that the address is structured is like
    this:

    31 11 10 5 4 0
    | | | | | |
    [ 21 bit tag ] [ 6 bit cache set ] [ 5 bit offset into 32 bit block ]

    Thus say we have an area of the stack with the address
    range nnnnFF80 to nnnnFFFF (128 bytes, 4 x 32 byte cache blocks).

    These four blocks all map to the same set: they have the same six
    bits in the "cache set" part of the address.

    So if a thread is accessing something in all four blocks, it will
    completely use that cache set, all by itself.

    If any other thread has a similar block in its stack, with the same
    cache set ID, it will cause evictions against this thread.

    Sure, if each of these threads confines itself to working with just one cacheline-sized aperture of the stack, it looks better.

    You're forgetting that the sets are very small and that groups of
    adjacent four 32 byte blocks map to the same set. Touch four adjacent
    cache blocks that are aligned on a 128 byte boundary, and you have
    hit full occupancy in the cache set corresponding to that block!

    (I suspect the references to 64K should not be kilobytes but sets.
    The 8K cache has 64 sets.)

    In memory, 128 byte blocks that is aligned maps to, and precisely covers
    a cache set. If two such blocks addresses that are equal modulo 8K, they collide to the same cache set. If one of those blocks is fully present
    in the cache, the other must be fully evicted.

    It's really easy to see how things can go south under hyperthreading.
    If two hyperthreads are working with clashing 128 byte areas that each
    want to hog the same cache set, and the core is switching between them
    on a fine-grained basis, ... you get the picture.

    It's very easy for the memory mapping allocations used for thread
    stacks to produce addresses such tha the delta between them is a
    multiple of 8K.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to Chris M. Thomasson on Mon Jan 8 17:14:55 2024
    On 1/8/2024 12:18 PM, Chris M. Thomasson wrote:
    On 1/7/2024 9:48 PM, Bonita Montero wrote:
    Am 07.01.2024 um 21:46 schrieb Chris M. Thomasson:

    I know that they had a problem and the provided workaround from Intel
    really did help out. ...

    Absolutely not, not with four way associativity.


    Whatever you say; Sigh. I am done with this.

    Intel just needs to call Bonita whenever they have an issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Tue Jan 9 02:02:13 2024
    On 2024-01-08, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 07.01.2024 um 21:46 schrieb Chris M. Thomasson:

    I know that they had a problem and the provided workaround from Intel
    really did help out. ...

    Absolutely not, not with four way associativity.

    You are referring to that as if it were some rescuing magic.

    Four way associativity is quite poor. It's only a step above two-way,
    which is a step above the bottom of the barrel: direct mapping.

    The cache entries are tiny on the P4: 32 bytes. An entire set is just
    128 bytes. Code that works intensely with a 128 byte block occupies
    the entire set of four cache blocks. If anything else touches memory
    that maps to the same set, an evacuation has to take place, and
    then when that code is resumed, it will take a hit.

    If two hyperthreads run the same function which works intensively
    with a 128 byte area of the stack, and the distance between the
    stacks is a multiple of 8K, they will interfere through the same
    cache set. The cache set is 128 bytes; each thread wants to keep
    its own 128 bytes in it.

    The situation seems far from improbable to me.

    We don't need more than four hyper-threads in order to blow the
    four-way set associative cache. We just need more than one
    in the above scenario.

    Needing more that four would be the case for threads that are hammering
    away on a 32 byte block. (And so since there are only two hyperthreads
    on the P4, that wouldn't be a problem.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sat Jan 13 21:31:42 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 24/12/2023 08:36, Tim Rentsch wrote:

    Depending on how the code is written, no modulo operations need
    to be done, because they will be optimized away and done at
    compile time. If we look at multiplying two numbers represented
    by bits in bytes i and j, the two numbers are

    i*30 + a
    j*30 + b

    for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.

    The values we're interested in are the index of the product and
    the residue of the product, namely

    (i*30+a) * (j*30+b) / 30 (for the index)
    (i*30+a) * (j*30+b) % 30 (for the residue)

    Any term with a *30 in the numerator doesn't contribute to
    the residue, and also can be combined with the by-30 divide
    for computing the index. Thus these expressions can be
    rewritten as

    i*30*j + i*b + j*a + (a*b/30) (for the index)
    a*b%30 (for the residue)

    When a and b have values that are known at compile time,
    neither the divide nor the remainder result in run-time
    operations being done; all of that heavy lifting is
    optimized away and done at compile time. Of course there
    are some multiplies, but they are cheaper than divides, and
    also can be done in parallel. (The multiplication a*b also
    can be done at compile time.)

    The residue needs to be turned into a bit mask to do the
    logical operation on the byte of bits, but here again that
    computation can be optimized away and done at compile time.

    Does that all make sense?

    Right now, no. But that's me. I'll flag it to read again when I've
    had a better night's sleep.

    I'm posting to nudge you into looking at this again, if
    you haven't already.

    I have now had a chance to get your source and run some
    comparisons. A program along the lines I outlined can run much
    faster than the code you posted (as well as needing less memory).
    A good target is to find all primes less than 1e11, which needs
    less than 4gB of ram.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to All on Thu May 16 17:28:55 2024
    I've been playing with this. Has it really been this long? I ought to
    have more time than this...

    I put together some code, and have been playing with it far too much.

    Bonita, I don't know whether to curse you or kiss you. (Though in fact I suspect you may be a little young for me. Possibly too young for my kids...)

    I haven't written any code just for fun in years. Maybe decades!

    I missed Tim's Mod 30 trick, and that might well help. But I think I'm
    bored with this. It would save a lot of memory, but the extra
    computation might make it slower. Maybe I'll try one day ;)

    My code isn't as fast as Bonita's code. Even allowing that I didn't put
    any threads in there.

    But...

    When I compared the output there were differences. Specifically your
    program claimed that these numbers

    66049,67591,69133,69647,71189,72217,72731,75301,78899,79927,80441,81469,85067,86609,89179,
    89693,90721,92263,94319,95861,97403,98431,99973,102029,103057,105113,107683,108197,110767,
    111281,112823,113851,115393,117449,118477,118991,120019,123103,125159,126187,128243,129271,
    130813,133897,134411,139037,140579,143149,144691,146233,146747,148289,150859,152401,153943,
    154457,155999,157541,158569,159083,162167,164737,165251,166279,167821,169363,169877,172961,
    173989,175531,177587,180157,182213,184783,186839,188381,189923,190951,193007,194549,195577,
    197633,198661,202259,204829,207913,208427,210997,211511,212539,213053,215623,219221,220249,
    220763,221791,225389,226417,226931,227959,233099,234127,236183,238753,240809,241837,243379,
    244921,248519,249547,251089,252631,254687,256229,259313,260341,261883,262397,264967,265481,
    267023,269593,270107,272677,273191,274733,279359,280387,280901,281929,283471,285013,287069,
    288611,290153,295807,296321,298891,300947,303517,305059,306601,308657,311741,312769,314311,
    315853

    are all prime. They aren't. The first one is 257 squared, and the others
    all have factors too.

    I'd also suggest that you comment demonstration code. It's intended to
    show off fancy techniques, and they aren't obvious without explanation.

    I learned that lesson when I was a student, when I couldn't understand something I'd written myself!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Vir Campestris on Thu May 16 21:40:02 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:
    ...
    When I compared the output there were differences. Specifically your
    program claimed that these numbers

    66049,67591,69133,69647,71189,72217,72731,75301,78899,79927,80441,81469,85067,86609,89179,
    89693,90721,92263,94319,95861,97403,98431,99973,102029,103057,105113,107683,108197,110767,
    111281,112823,113851,115393,117449,118477,118991,120019,123103,125159,126187,128243,129271,
    130813,133897,134411,139037,140579,143149,144691,146233,146747,148289,150859,152401,153943,
    154457,155999,157541,158569,159083,162167,164737,165251,166279,167821,169363,169877,172961,
    173989,175531,177587,180157,182213,184783,186839,188381,189923,190951,193007,194549,195577,
    197633,198661,202259,204829,207913,208427,210997,211511,212539,213053,215623,219221,220249,
    220763,221791,225389,226417,226931,227959,233099,234127,236183,238753,240809,241837,243379,
    244921,248519,249547,251089,252631,254687,256229,259313,260341,261883,262397,264967,265481,
    267023,269593,270107,272677,273191,274733,279359,280387,280901,281929,283471,285013,287069,
    288611,290153,295807,296321,298891,300947,303517,305059,306601,308657,311741,312769,314311,
    315853

    are all prime. They aren't. The first one is 257 squared, and the others
    all have factors too.

    In fact they are /all/ 257 times some prime. That must be a big clue as
    to where the bug is...

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Tue May 21 19:06:07 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I've been playing with this. Has it really been this long? I
    ought to have more time than this... [...]

    I missed Tim's Mod 30 trick, and that might well help. But I
    think I'm bored with this. It would save a lot of memory, but
    the extra computation might make it slower.

    Two comments. Using the mod 30 encoding can be speed competitive
    with simpler encodings. The second comment is, it isn't just
    that less memory is used, but that more primes can be found.
    Bonita brags about how fast some code is, but it's an apples
    and oranges comparison - that code doesn't do the job because
    it runs into a memory limit way sooner than using a mod 30
    encoding, and trying to run past that limit would make the
    code R E A L L Y S L O W.

    Maybe I'll try one day ;)

    I hope you do. If you have some difficulty seeing how to
    avoid the seeming speed penalties, feel free to ask, I am
    happy to help.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Thu May 30 12:32:04 2024
    On 22/05/2024 03:06, Tim Rentsch wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I've been playing with this. Has it really been this long? I
    ought to have more time than this... [...]

    I missed Tim's Mod 30 trick, and that might well help. But I
    think I'm bored with this. It would save a lot of memory, but
    the extra computation might make it slower.

    Two comments. Using the mod 30 encoding can be speed competitive
    with simpler encodings. The second comment is, it isn't just
    that less memory is used, but that more primes can be found.
    Bonita brags about how fast some code is, but it's an apples
    and oranges comparison - that code doesn't do the job because
    it runs into a memory limit way sooner than using a mod 30
    encoding, and trying to run past that limit would make the
    code R E A L L Y S L O W.

    Maybe I'll try one day ;)

    I hope you do. If you have some difficulty seeing how to
    avoid the seeming speed penalties, feel free to ask, I am
    happy to help.

    I think I've come to the end of what my code will do. I'm not going to
    post it up (unless you really want it), but I'll put in some comments.

    Firstly the C++ bit:

    std::vector<bool> is much too slow to be any use. I rolled my own.

    I played with std::vector<uint8_t>, uint32_t and uint64_t. The last was fastest.

    But... if you allocate a std::vector it insists on initialising all the elements. That takes a while. I then played with reserve and
    emplace_back - and that was slow too. So I've rolled my own store.

    Allocating a zero filled buffer of enormous size with calloc is
    surprisingly fast. I have a feeling (which I can't prove) that under the
    hood Linux is allocating one zero filled page to me, and setting up the
    page tables for the rest of the buffer as unwriteable. It then copies
    the zero buffer whenever a write fault happens.

    Of course I can when needed use malloc, not calloc - sometimes I don't
    care what is in the buffer.

    That's the end of the on-topic part.

    I don't store even numbers. That's an optimisation, and is of course
    equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it doesn't require an divide which may be slow.

    When I mask out just 3 I get a pattern like this:

    2492492492492490

    There is a single nybble at the end of that which is zero, but the rest
    is a 3-nybble repeat. You see something similar with 5,

    4210842108421080

    except this time the repeat is 5 nybbles.

    It doesn't matter what your word size is - nybble, byte, uint64_t - the
    repeat is always the prime. It also incidentally doesn't matter if you
    do store evens, the repeat is still the size of the prime.

    Now 3 is the most expensive prime to write into the output mask. I have
    to write every third bit, which is a lot of operations. 5 is nearly as
    bad, and as the prime gets bigger the cost falls.

    But... I don't have to write all those bits. I can just copy the
    repeating part of the mask. With uint64_t the mask for 3 is

    2492492492492490,[9249249249249249,4924924924924924,2492492492492492]

    I don't copy that first word, only the three repeats.

    for 5 it's

    4210842108421080,[8421084210842108,0842108421084210,1084210842108421,2108421084210842,4210842108421084]

    I then realised that running an iterator over just 3 words was fairly
    expensive too - not as expensive as every bit, but it still hurts.

    However I can build a mask from the ones for 3 and 5, which has a repeat
    of 15 (3*5)
    6692cd259a4b3490, 96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b, 4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25, 259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692, 92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496

    and copy that.

    In fact I can do that with any number of single prime masks I want.

    It gets big quite quickly - for 3 to 23 the repeat length is
    111,546,435. I found that it was best to:

    - Make a mask for the first 5 primes, 3 5 7 11 13.
    - Make a mask from all of these. That has a repeat of 15015 words.
    - Make masks for the next few primes, up to 31.
    - Combine the mask from the 15015 and the others up to 31. That's about
    100E9 in size. (It's better to do that in two stages, otherwise we reset
    the index on the small ones really often.)
    - Work out the biggest prime we have stored accurately. That will be
    just under 37 squared, 37 being the next one we haven't masked in
    - Collect all the primes between 31 and 37^2
    - Go though the buffer in chunks of about 16k words, marking the
    multiples of all these primes.
    - We can now work out all the rest of the primes we need. Collect them,
    then go through the buffer again.
    - It's not necessary to do this a third time.

    Note that although this has been quite a fun thing to play with it's had
    almost no C++ content. In fact I found that to get best performance I
    had to drop down to C type code. I kept a class to manage my buffers
    though, it's much easier to drop an object that manages a malloc-ed
    buffer than to remember exactly when I need to call free. Correctness
    trumps speed.

    Now Mod 30 - by storing the odd numbers only I'm halving the storage,
    and I don't need to do an expensive divide. It's only shifts and ANDs.

    If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate numbers
    I need to store a mask for. Sadly my computer likes 64 bit operations
    better. But the required storage is now only 27% of the original mask if
    all numbers are put in it.

    I can carry on with this, but the storage doesn't have any more nice
    sizes. The number of bits in each page are:
    Prime, MOD value, number of bits per page, saving
    2, 2 1, 50%
    3, 6, 2, 33%
    5, 30, 8, 27%
    7, 210, 48, 23%
    11, 2310, 480, 21%
    13, 30030, 5760, 19%
    17, 510510, 92160, 18%
    19, 9699690, 1658880, 17%
    23, 223092870, 36495360, 16%

    That's probably best pasted into a spreadsheet as CSV to make the
    columns line up.

    Maybe I'll play with it. But the rain stopped, and I must get on with
    the weeding!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paavo Helde@21:1/5 to All on Thu May 30 19:55:44 2024

    Allocating a zero filled buffer of enormous size with calloc is
    surprisingly fast. I have a feeling (which I can't prove) that under the
    hood Linux is allocating one zero filled page to me, and setting up the
    page tables for the rest of the buffer as unwriteable. It then copies
    the zero buffer whenever a write fault happens.

    This might well be so (a standard COW page mapped initially to a special
    zeroed page), but from what google tells me there are also special
    kernel threads whose task is to zero out unused memory pages so they can
    be given to new processes. This is needed for security, but would also
    speed up calloc().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Thu May 30 22:17:00 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 22/05/2024 03:06, Tim Rentsch wrote:

    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I've been playing with this. Has it really been this long? I
    ought to have more time than this... [...]

    I missed Tim's Mod 30 trick, and that might well help. But I
    think I'm bored with this. It would save a lot of memory, but
    the extra computation might make it slower.

    Two comments. Using the mod 30 encoding can be speed competitive
    with simpler encodings. The second comment is, it isn't just
    that less memory is used, but that more primes can be found.
    Bonita brags about how fast some code is, but it's an apples
    and oranges comparison - that code doesn't do the job because
    it runs into a memory limit way sooner than using a mod 30
    encoding, and trying to run past that limit would make the
    code R E A L L Y S L O W.

    Maybe I'll try one day ;)

    I hope you do. If you have some difficulty seeing how to
    avoid the seeming speed penalties, feel free to ask, I am
    happy to help.

    I think I've come to the end of what my code will do. [...]

    A few miscellaneous responses...

    Firstly the C++ bit:

    std::vector<bool> is much too slow to be any use.
    [various alternatives]

    I simply used a static array, more accurately a union of
    two arrays (code here is approximate only):

    #define LARGEST_N 2400000000000 // primes less than 2.4e12

    union {
    unsigned char bytes[ LARGEST_N / 30 ];
    unsigned long long ulls[ LARGEST_N / 30 / 8 ];
    } all;

    That's the end of the on-topic part.

    I don't store even numbers. That's an optimisation, and is of course equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it
    doesn't require an divide which may be slow.

    Like I keep saying, keeping the bits in a mod 30 representation
    doesn't need any divides (at least not in the main loop, where
    multiples are being computed).

    When I mask out just 3 I get a pattern like this:

    2492492492492490

    There is a single nybble at the end of that which is zero, but the
    rest is a 3-nybble repeat. [similarly for 3 and 5, etc...]

    In a mod 30 representation, the multiples of 7 repeat after 7
    bytes. If we replicate those 7 bytes 8 times, we get 7 unsigned
    long longs. Those 7 ULLs can be combined with logic operations into
    the all.ulls[] array. Similarly for 11, 13, etc, for as high as is
    reasonable (I think I topped out at 29).

    However I can build a mask from the ones for 3 and 5, which has a
    repeat of 15 (3*5)
    6692cd259a4b3490, 96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b, 4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25, 259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692, 92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496

    and copy that.

    In fact I can do that with any number of single prime masks I want.
    [...]

    I thought about doing something along these lines but decided it
    was too much work. So I just did the blocks of 7*8 bytes, 11*8
    bytes, 13*8 bytes, etc, up to some fairly small prime, then
    switched to an alternate method where individual bytes are
    addressed and modified.

    Note that although this has been quite a fun thing to play with it's
    had almost no C++ content. In fact I found that to get best
    performance I had to drop down to C type code.

    Right. I used straight C, and I don't see any part of the
    program that would benefit from having C++.

    I kept a class to
    manage my buffers though, it's much easier to drop an object that
    manages a malloc-ed buffer than to remember exactly when I need to
    call free. Correctness trumps speed.

    I used static allocation for the big bitmap, and small local
    arrays (less than 50 * 8 bytes) for the small temporaries.
    No malloc()s or free()s to worry about. Initialized to zero
    without any code needed.

    Now Mod 30 - by storing the odd numbers only I'm halving the storage,
    and I don't need to do an expensive divide. It's only shifts and ANDs.

    If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
    numbers I need to store a mask for.

    By special casing each of the possibilities (there are 8*8 == 64 of
    them), only multiplies and adds are needed, and no shifts -- masks
    are evaluated to compile-time constants.

    Sadly my computer likes 64 bit
    operations better. But the required storage is now only 27% of the
    original mask if all numbers are put in it.

    Once the primes get above a fairly small number, the density of
    places to zap is low enough so that working in bytes is okay.
    In particular once we are looking at primes above 240 there is
    less than one bit per 64 bit ULL, which makes using bytes
    rather than ULLs less painful

    I can carry on with this, but the storage doesn't have any more nice
    sizes. [...]

    Yeah. It's a happy coincidence that mod 30 needs 8 bits
    per block-of-30, and unfortunately no other nice size in
    sight for larger prime-set mods.

    Maybe I'll play with it. But the rain stopped, and I must get on
    with the weeding!

    Now you have something to look forward to the next time
    it rains. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paavo Helde@21:1/5 to Bonita Montero on Fri May 31 20:52:25 2024
    On 31.05.2024 11:17, Bonita Montero wrote:
    Am 30.05.2024 um 18:55 schrieb Paavo Helde:

    This might well be so (a standard COW page mapped initially to a
    special zeroed page), but from what google tells me there are also
    special kernel threads whose task is to zero out unused memory pages
    so they can be given to new processes. ...

    Any reference on that ?


    "https://answers.microsoft.com/en-us/windows/forum/all/physical-and-virtual-memory-in-windows-10/e36fb5bc-9ac8-49af-951c-e7d39b979938"

    "The memory manager maintains a thread that wakes up periodically to
    initialize pages on the Free page list so that it can move them to the
    Zero page list. The Zero page list contains pages that have been
    initialized to zero, ready for use when the memory manager needs a new
    page."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Sat Jun 1 21:07:56 2024
    On 31/05/2024 06:17, Tim Rentsch wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 22/05/2024 03:06, Tim Rentsch wrote:

    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I've been playing with this. Has it really been this long? I
    ought to have more time than this... [...]

    I missed Tim's Mod 30 trick, and that might well help. But I
    think I'm bored with this. It would save a lot of memory, but
    the extra computation might make it slower.

    Two comments. Using the mod 30 encoding can be speed competitive
    with simpler encodings. The second comment is, it isn't just
    that less memory is used, but that more primes can be found.
    Bonita brags about how fast some code is, but it's an apples
    and oranges comparison - that code doesn't do the job because
    it runs into a memory limit way sooner than using a mod 30
    encoding, and trying to run past that limit would make the
    code R E A L L Y S L O W.

    Maybe I'll try one day ;)

    I hope you do. If you have some difficulty seeing how to
    avoid the seeming speed penalties, feel free to ask, I am
    happy to help.

    I think I've come to the end of what my code will do. [...]

    A few miscellaneous responses...

    Firstly the C++ bit:

    std::vector<bool> is much too slow to be any use.
    [various alternatives]

    I simply used a static array, more accurately a union of
    two arrays (code here is approximate only):

    #define LARGEST_N 2400000000000 // primes less than 2.4e12

    union {
    unsigned char bytes[ LARGEST_N / 30 ];
    unsigned long long ulls[ LARGEST_N / 30 / 8 ];
    } all;

    That's the end of the on-topic part.

    I don't store even numbers. That's an optimisation, and is of course
    equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it
    doesn't require an divide which may be slow.

    Like I keep saying, keeping the bits in a mod 30 representation
    doesn't need any divides (at least not in the main loop, where
    multiples are being computed).

    When I mask out just 3 I get a pattern like this:

    2492492492492490

    There is a single nybble at the end of that which is zero, but the
    rest is a 3-nybble repeat. [similarly for 3 and 5, etc...]

    In a mod 30 representation, the multiples of 7 repeat after 7
    bytes. If we replicate those 7 bytes 8 times, we get 7 unsigned
    long longs. Those 7 ULLs can be combined with logic operations into
    the all.ulls[] array. Similarly for 11, 13, etc, for as high as is reasonable (I think I topped out at 29).

    However I can build a mask from the ones for 3 and 5, which has a
    repeat of 15 (3*5)
    6692cd259a4b3490,
    96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b,
    4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25,
    259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692,
    92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496

    and copy that.

    In fact I can do that with any number of single prime masks I want.
    [...]

    I thought about doing something along these lines but decided it
    was too much work. So I just did the blocks of 7*8 bytes, 11*8
    bytes, 13*8 bytes, etc, up to some fairly small prime, then
    switched to an alternate method where individual bytes are
    addressed and modified.

    Note that although this has been quite a fun thing to play with it's
    had almost no C++ content. In fact I found that to get best
    performance I had to drop down to C type code.

    Right. I used straight C, and I don't see any part of the
    program that would benefit from having C++.

    I kept a class to
    manage my buffers though, it's much easier to drop an object that
    manages a malloc-ed buffer than to remember exactly when I need to
    call free. Correctness trumps speed.

    I used static allocation for the big bitmap, and small local
    arrays (less than 50 * 8 bytes) for the small temporaries.
    No malloc()s or free()s to worry about. Initialized to zero
    without any code needed.

    Now Mod 30 - by storing the odd numbers only I'm halving the storage,
    and I don't need to do an expensive divide. It's only shifts and ANDs.

    If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
    numbers I need to store a mask for.

    By special casing each of the possibilities (there are 8*8 == 64 of
    them), only multiplies and adds are needed, and no shifts -- masks
    are evaluated to compile-time constants.

    Sadly my computer likes 64 bit
    operations better. But the required storage is now only 27% of the
    original mask if all numbers are put in it.

    Once the primes get above a fairly small number, the density of
    places to zap is low enough so that working in bytes is okay.
    In particular once we are looking at primes above 240 there is
    less than one bit per 64 bit ULL, which makes using bytes
    rather than ULLs less painful


    One C++ bit I did use was to template all my classes with the storetype,
    so I could run with 8, 32 or 64 bits. 64 was fastest. I didn't bother
    with 16. The store access is of course behind the hood a cache line.

    I can carry on with this, but the storage doesn't have any more nice
    sizes. [...]

    Yeah. It's a happy coincidence that mod 30 needs 8 bits
    per block-of-30, and unfortunately no other nice size in
    sight for larger prime-set mods.

    Maybe I'll play with it. But the rain stopped, and I must get on
    with the weeding!

    Now you have something to look forward to the next time
    it rains. :)

    I'm missing something.

    When setting the bits for a multiple of a reasonably large prime what I
    do is:
    (1) Shift right 1 to drop the last bit.
    (2) The remaining top bits tell me which word needs to be accessed. I
    shift the multiple right to get the index of the word.
    (3) The remaining bottom bits tell me which bit in the word needs to be
    set, so I shift 1 left by those bottom bits to get a mask.

    My operation (2) is a divide, by a power of 2.

    Take 997 as an example:
    multiple mod30 div30
    997 7 33
    1994 14 66
    2991 21 99
    3988 28 132
    4985 5 166
    5982 12 199
    6979 19 232
    7976 26 265
    8973 3 299
    9970 10 332
    10967 17 365
    11964 24 398
    12961 1 432
    13958 8 465
    14955 15 498
    15952 22 531
    16949 29 564
    17946 6 598
    18943 13 631
    19940 20 664
    20937 27 697
    21934 4 731
    22931 11 764
    23928 18 797
    24925 25 830
    25922 2 864
    26919 9 897
    27916 16 930
    28913 23 963
    29910 0 997
    30907 7 1030

    I don't see how you get that last column without a divide by 30.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Vir Campestris on Sat Jun 1 20:43:49 2024
    On 6/1/24 4:07 PM, Vir Campestris wrote:
    On 31/05/2024 06:17, Tim Rentsch wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 22/05/2024 03:06, Tim Rentsch wrote:

    Vir Campestris <vir.campestris@invalid.invalid> writes:

    I've been playing with this.  Has it really been this long?  I
    ought to have more time than this...  [...]

    I missed Tim's Mod 30 trick, and that might well help.  But I
    think I'm bored with this.  It would save a lot of memory, but
    the extra computation might make it slower.

    Two comments.  Using the mod 30 encoding can be speed competitive
    with simpler encodings.  The second comment is, it isn't just
    that less memory is used, but that more primes can be found.
    Bonita brags about how fast some code is, but it's an apples
    and oranges comparison - that code doesn't do the job because
    it runs into a memory limit way sooner than using a mod 30
    encoding, and trying to run past that limit would make the
    code R E A L L Y  S L O W.

    Maybe I'll try one day ;)

    I hope you do.  If you have some difficulty seeing how to
    avoid the seeming speed penalties, feel free to ask, I am
    happy to help.

    I think I've come to the end of what my code will do. [...]

    A few miscellaneous responses...

    Firstly the C++ bit:

    std::vector<bool> is much too slow to be any use.
    [various alternatives]

    I simply used a static array, more accurately a union of
    two arrays (code here is approximate only):

         #define LARGEST_N 2400000000000  // primes less than 2.4e12

         union {
             unsigned char       bytes[ LARGEST_N / 30 ];
             unsigned long long  ulls[  LARGEST_N / 30 / 8 ];
         } all;

    That's the end of the on-topic part.

    I don't store even numbers.  That's an optimisation, and is of course
    equivalent to Tim's Mod 30 trick, but with mod 2 instead.  And it
    doesn't require an divide which may be slow.

    Like I keep saying, keeping the bits in a mod 30 representation
    doesn't need any divides (at least not in the main loop, where
    multiples are being computed).

    When I mask out just 3 I get a pattern like this:

    2492492492492490

    There is a single nybble at the end of that which is zero, but the
    rest is a 3-nybble repeat.  [similarly for 3 and 5, etc...]

    In a mod 30 representation, the multiples of 7 repeat after 7
    bytes.  If we replicate those 7 bytes 8 times, we get 7 unsigned
    long longs.  Those 7 ULLs can be combined with logic operations into
    the all.ulls[] array.  Similarly for 11, 13, etc, for as high as is
    reasonable (I think I topped out at 29).

    However I can build a mask from the ones for 3 and 5, which has a
    repeat of 15 (3*5)
    6692cd259a4b3490,
    96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b,
    4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25,
    259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692,
    92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496

    and copy that.

    In fact I can do that with any number of single prime masks I want.
    [...]

    I thought about doing something along these lines but decided it
    was too much work.  So I just did the blocks of 7*8 bytes, 11*8
    bytes, 13*8 bytes, etc, up to some fairly small prime, then
    switched to an alternate method where individual bytes are
    addressed and modified.

    Note that although this has been quite a fun thing to play with it's
    had almost no C++ content.  In fact I found that to get best
    performance I had to drop down to C type code.

    Right.  I used straight C, and I don't see any part of the
    program that would benefit from having C++.

    I kept a class to
    manage my buffers though, it's much easier to drop an object that
    manages a malloc-ed buffer than to remember exactly when I need to
    call free.  Correctness trumps speed.

    I used static allocation for the big bitmap, and small local
    arrays (less than 50 * 8 bytes) for the small temporaries.
    No malloc()s or free()s to worry about.  Initialized to zero
    without any code needed.

    Now Mod 30 - by storing the odd numbers only I'm halving the storage,
    and I don't need to do an expensive divide.  It's only shifts and ANDs. >>>
    If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
    numbers I need to store a mask for.

    By special casing each of the possibilities (there are 8*8 == 64 of
    them), only multiplies and adds are needed, and no shifts -- masks
    are evaluated to compile-time constants.

    Sadly my computer likes 64 bit
    operations better.  But the required storage is now only 27% of the
    original mask if all numbers are put in it.

    Once the primes get above a fairly small number, the density of
    places to zap is low enough so that working in bytes is okay.
    In particular once we are looking at primes above 240 there is
    less than one bit per 64 bit ULL, which makes using bytes
    rather than ULLs less painful


    One C++ bit I did use was to template all my classes with the storetype,
    so I could run with 8, 32 or 64 bits. 64 was fastest. I didn't bother
    with 16. The store access is of course behind the hood a cache line.

    I can carry on with this, but the storage doesn't have any more nice
    sizes.  [...]

    Yeah.  It's a happy coincidence that mod 30 needs 8 bits
    per block-of-30, and unfortunately no other nice size in
    sight for larger prime-set mods.

    Maybe I'll play with it.  But the rain stopped, and I must get on
    with the weeding!

    Now you have something to look forward to the next time
    it rains. :)

    I'm missing something.

    When setting the bits for a multiple of a reasonably large prime what I
    do is:
    (1) Shift right 1 to drop the last bit.
    (2) The remaining top bits tell me which word needs to be accessed. I
    shift the multiple right to get the index of the word.
    (3) The remaining bottom bits tell me which bit in the word needs to be
    set, so I shift 1 left by those bottom bits to get a mask.

    My operation (2) is a divide, by a power of 2.

    Take 997 as an example:
    multiple    mod30    div30
    997    7    33
    1994    14    66
    2991    21    99
    3988    28    132
    4985    5    166
    5982    12    199
    6979    19    232
    7976    26    265
    8973    3    299
    9970    10    332
    10967    17    365
    11964    24    398
    12961    1    432
    13958    8    465
    14955    15    498
    15952    22    531
    16949    29    564
    17946    6    598
    18943    13    631
    19940    20    664
    20937    27    697
    21934    4    731
    22931    11    764
    23928    18    797
    24925    25    830
    25922    2    864
    26919    9    897
    27916    16    930
    28913    23    963
    29910    0    997
    30907    7    1030

    I don't see how you get that last column without a divide by 30.

    Andy



    Incrementally.

    Each line increase the div30 by a fixed amount, with a +1 every time the
    mod30 accumulator get to 30 or above. So you can take the prime and
    compute the mod30 / div30 for it, and then move to the next multiple
    with only adds and a compare.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sun Jun 2 03:23:24 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 31/05/2024 06:17, Tim Rentsch wrote:

    Vir Campestris <vir.campestris@invalid.invalid> writes:

    [...]

    I don't store even numbers. That's an optimisation, and is of
    course equivalent to Tim's Mod 30 trick, but with mod 2 instead.
    And it doesn't require an divide which may be slow.

    Like I keep saying, keeping the bits in a mod 30 representation
    doesn't need any divides (at least not in the main loop, where
    multiples are being computed).

    [...]

    I'm missing something.

    When setting the bits for a multiple of a reasonably large prime
    what I do is:

    (1) Shift right 1 to drop the last bit.

    (2) The remaining top bits tell me which word needs to be accessed.
    I shift the multiple right to get the index of the word.

    (3) The remaining bottom bits tell me which bit in the word needs
    to be set, so I shift 1 left by those bottom bits to get a mask.

    My operation (2) is a divide, by a power of 2.

    Take 997 as an example:
    multiple mod30 div30
    997 7 33
    1994 14 66
    2991 21 99
    3988 28 132
    4985 5 166
    5982 12 199
    6979 19 232
    7976 26 265
    8973 3 299
    9970 10 332
    [etc]

    I don't see how you get that last column without a divide by 30.

    Good example. Let's look at this example more closely, first in
    the a-bit-for-only-odd-numbers scheme. Forgive me if any of the
    following seems obvious. Let me redo your table slightly:

    multiplier product mod 2
    1 997 1
    2 1994 0
    3 2991 1
    4 3988 0
    5 4985 1
    6 5982 0
    7 6979 1
    8 7976 0
    9 8973 1
    10 9970 0
    11 10967 1
    [etc]

    Which multipliers do we need to consider? Obviously not 1.
    After that, we don't need any even multipliers, since they
    are all divisible by two, and there aren't even places in
    the table where the product could go. So we quickly get to
    this:

    multiplier product mod 2
    3 2991 1
    5 4985 1
    7 6979 1
    9 8973 1
    11 10967 1
    [etc]

    Now which multipliers do we need? We don't need 3 - we have
    already crossed out all multiples of 3. Similarly we don't
    need 5, 7, or 11. We don't need 9, because it has a factor
    of 3, - any multiple of 9 is also a multiple of 3 and has already
    been crossed out. And so forth for 13, 15, 17, 19, 21, ...
    A little thought should convince you we don't need to consider
    any multiplier less than 997. Let's pick up the table from
    there (I have taken out 'product' and put in 'factors'):

    multiplier factors
    997 997
    999 3 3 3 37
    1001 7 11 13
    1003 17 59
    1005 3 5 67
    1007 19 53
    1009 1009
    1011 3 337

    We need to eliminate 997*997, as it certainly has not been
    crossed out before. We don't need to consider 999, 1001,
    1003, 1005, 1007, or 1011, because they all have prime
    factors less than 997, and so these products have already
    been eliminated. For 1009 though, it's prime, and has no
    smaller prime factors, so we need to cross out 997*1009.
    And so forth. Speaking more generally, we need to consider
    multipliers greater than 997 (as well as 997 itself) that
    have not already been marked as composite. We discover what
    those numbers are by scanning the bitmap. In pseudocode:

    for each prime p :
    starting at the bitmap location for p, and up
    for each not-yet-marked-composite m :
    mark as composite p*m
    end for
    end for

    (I need to put in an asterisk here to say that this code
    mostly works but misses some composites, and if you play
    around with it you will see why. But that isn't important
    to my explanation so I'd like to pretend for now that
    this code is what we need to look at (and in fact it
    isn't far from code that actually does work).)

    Now let's think about how things change when we are storing
    bits only for numbers that are nonzero mod 30, so one block
    of 30 per byte. First it should be obvious that we need to
    consider only those multipliers that are nonzero mod 30, or
    in other words those numbers that correspond to a bit in the
    table. Pseudocode now looks something like this:

    for each index i in the table :
    for each bit in table[i] :
    if that bit has been marked composite : continue
    // we have found the next prime
    for each index j > i in the table :
    for each bit in table[j] :
    if that bit has beem marked composite : continue
    // we have found a multiplier m and ..
    // need to mark as composite p*m
    end
    end
    end
    end

    The relevant values are:

    p == i*30 + x
    m == j*30 + y

    where x and y are constant values that are nonzero mod 30,
    that is, from the set { 1, 7, 11, 13, 17, 19, 23, 29 },
    depending only on the respective bit positions for p and m.

    The product p*m is what we're looking for, and in particular
    p*m / 30 and p*m % 30. Doing a little algebra

    p*m = (i*30 + x) * (j*30 + y)
    = i*30 * j*30 + i*30*y + j*30*x + x*y

    The components of this last expression are

    p*m / 30 = 30*i*j + i*y + j*x + (x*y/30)
    p*m % 30 = (x*y%30)

    In the pseudocode we said 'for each bit in table[...]', but
    rather than use a for loop we write 8 specific tests, so the
    values x and y are specific constants in each of the 8 cases.
    Since they are constants, the values (x*y/30) and (x*y%30) can in
    each of the 64 cases be computed as compile-time constants, and
    so we eliminate the divide and remainder operations from run-time
    by computing them at compile time. Similarly the value (x*y%30)
    can be turned into a bit mask at compile-time, because it's just
    a long series of ?: expressions, one for each of the 8 values of
    nonzero residues mod 30.

    Note that the expression for p*m/30 gives the index in the
    table for where to zap the bit for the product.

    Were you able to follow all that? Do you see how the scheme
    works now?

    I have deliberately left out explaining the problem of missing
    some composite values and what to do about it, but I can
    explain further if someone has trouble with that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Tim Rentsch on Sun Jun 2 19:50:36 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Pseudocode now looks something like this:

    for each index i in the table :
    for each bit in table[i] :
    if that bit has been marked composite : continue
    // we have found the next prime
    for each index j > i in the table :
    for each bit in table[j] :
    if that bit has beem marked composite : continue
    // we have found a multiplier m and ..
    // need to mark as composite p*m
    end
    end
    end
    end

    Correction: that should be

    for each index j >= i in the table :

    Also I have left out some possible optimizations, in the
    interest of simplifying the explanation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to All on Tue Jun 18 20:56:04 2024
    OK, I'm getting some weird results.

    Firstly, whatever I do my original algorithm storing odds only is faster
    than using mod30. Quite a bit faster. I think this is because mod30
    requires it to access each byte in the map individually, rather than
    eight at a time.

    How much is when I got weird.

    Going through the large primes I've got a cache size.

    I go through each of the primes, adding a modulo and a word index to get
    the bit that needs to be set next. When the modulo overflows the word
    index gets incremented too.

    I do that for a block of memory, then repeat for the next prime, and so
    on until I've done all the primes. I then go onto the next block of
    memory - which is tuned to fit in the cache.

    Rather than do some arithmetic I though I'll just run through a loop
    with different sizes of that memory block, and pick the best one. That's
    a bit under a megabyte.

    I then remove the loop, instead having a constexpr for the size.

    The code now runs a third slower.

    I can have the loop go around exactly once, and it is still way faster
    than the hard coded constant!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Tue Jun 18 17:34:39 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    OK, I'm getting some weird results.

    Firstly, whatever I do my original algorithm storing odds only is
    faster than using mod30. Quite a bit faster. I think this is because
    mod30 requires it to access each byte in the map individually, rather
    than eight at a time.

    How much is when I got weird.

    Going through the large primes I've got a cache size.

    I go through each of the primes, adding a modulo and a word index to
    get the bit that needs to be set next. When the modulo overflows the
    word index gets incremented too.

    I do that for a block of memory, then repeat for the next prime, and
    so on until I've done all the primes. I then go onto the next block of memory - which is tuned to fit in the cache.

    Rather than do some arithmetic I though I'll just run through a loop
    with different sizes of that memory block, and pick the best
    one. That's a bit under a megabyte.

    I then remove the loop, instead having a constexpr for the size.

    The code now runs a third slower.

    I can have the loop go around exactly once, and it is still way faster
    than the hard coded constant!

    Would you mind if I asked you to post the code so I can
    take a look at it? Or if you would rather email it to
    me that also works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Sun Jun 30 21:47:54 2024
    On 19/06/2024 01:34, Tim Rentsch wrote:
    Would you mind if I asked you to post the code so I can
    take a look at it? Or if you would rather email it to
    me that also works.

    I've hacked out all my include files, added a GPL licence, and included
    build instructions. It follows in the signature block. 600 lines :(

    I bet the line wrapping breaks it too.

    Andy
    --
    /*
    Programme to calculate primes using the Sieve or Eratosthenes
    Copyright (C) 2024 the person known as Vir Campestris

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program. If not, see <https://www.gnu.org/licenses/>.
    */

    /*
    I build this programme in two ways on Linux. These commands

    g++ -O3 --std=c++20 usenet.cpp -o usenet && time ./usenet
    g++ -O3 --std=c++20 -D FAST usenet.cpp -o usenet && time ./usenet

    compile and run it with, and without, a loop that goes around once.

    It is faster when the loop is present.
    */

    #include <algorithm>
    #include <cassert>
    #include <chrono>
    #include <climits>
    #include <cmath>
    #include <csignal>
    #include <cstring>
    #include <fstream>
    #include <iomanip>
    #include <iostream>
    #include <list>
    #include <thread>
    #include <vector>

    // If these trigger you have a really odd architecture. static_assert(sizeof(uint8_t) == 1);
    static_assert(CHAR_BIT == 8);

    // The type of a prime number.
    // size_t isn't big enough on a 32-bit machine with 2GB memory allocated
    and 16 primes per byte.
    typedef uint64_t PRIME;

    // Internal class used to store the mask data for one or more
    // primes rather than all primes. This has an initial block,
    // followed by a block which contains data which is repeated
    // for all higher values.
    // Example: StoreType== uint8_t, prime == 3, the mask should be
    // 90 meaning 1,3,5,7,11,13 are prime but not 9, 15
    // 24,49,92 covers odds 17-63
    // 24,49,92 covers 65-111, with an identical mask
    // As the mask is identical we only need to store the
    // non-repeat, plus one copy of the repeated block and data to
    // show where it starts and ends. Note that for big primes the
    // initial block may be more than one StoreType.
    // As there are no common factors between any prime and the
    // number of bits in StoreType the repeat is the size of the
    // prime.
    // If you have an unusual architecture,
    // such as ICL 1900 (24 bit word)
    // or DECSystem10 (36 bit word)
    // this repeat could be smaller for 3,
    // but that is too much of a corner case for me to care.
    // I haven't seen one of them since the 1970s.
    template <typename StoreType, PRIME modulus=30> class Masker
    {
    // The length of the initial block.
    size_t initSize;

    // The size of the repeated part. The entire store size is the
    sum of these two.
    size_t repSize;

    // The actual buffer
    // Allocates a zero filled buffer
    class Buffer
    {
    StoreType* mpBuffer; // base of the data referred to
    size_t mSize; // size of the buffer in
    StoreType-s.

    public:
    // Normal constructor
    Buffer(size_t size = 0) : mpBuffer(nullptr), mSize(size)
    {
    if (size > 0)
    {
    mpBuffer = static_cast<StoreType*>(std::malloc(size * sizeof(StoreType)));

    if (!mpBuffer)
    throw std::bad_alloc();
    }
    }

    // Move constructor
    Buffer(Buffer&& source) : mpBuffer(source.mpBuffer), mSize(source.mSize)
    {
    source.mSize = 0;
    source.mpBuffer = nullptr;
    }

    // Move assignment
    Buffer& operator=(Buffer&& source)
    {
    if (mpBuffer)
    std::free(mpBuffer);
    mpBuffer = source.mpBuffer;
    mSize = source.mSize;
    source.mSize = 0;
    source.mpBuffer = nullptr;
    return *this;
    }

    void resize(size_t newSize)
    {
    mpBuffer = static_cast<StoreType*>(std::realloc(mpBuffer, newSize *
    sizeof(StoreType)));
    if (!mpBuffer)
    throw std::bad_alloc();
    mSize = newSize;
    }

    // Get the data start
    inline StoreType* operator*() const
    {
    return mpBuffer;
    }

    // Get the owned size
    inline PRIME const & size() const
    {
    return mSize;
    }

    // clean up.
    ~Buffer()
    {
    if (mpBuffer)
    std::free(mpBuffer);
    }
    } mBuffer;

    // Raw pointer to the store for speed.
    StoreType * mStorePtr;

    // Lookup table for the mask bit from the modulus
    static StoreType const mMaskLookup[modulus];

    public:
    class MaskIterator
    {
    StoreType const * index, *repStart, *repEnd;
    public:
    MaskIterator(Masker const * origin)
    : index(origin->mStorePtr)
    , repStart(index + origin->initSize)
    , repEnd(repStart + origin->repSize)
    {}

    inline StoreType const & next() // returns value with
    iterator post-increment
    {
    auto const & retval = *index;
    if (++index >= repEnd)
    index = repStart;
    return retval;
    }
    };

    // Default constructor. Makes a dummy mask for overwriting.
    Masker()
    : initSize(0)
    , repSize(1)
    , mBuffer(1)
    , mStorePtr(*mBuffer)
    {
    *mStorePtr = 0; // nothing marked unprime
    }

    // Move constructor (destroys source)
    Masker(Masker&& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(std::move(source.mBuffer))
    , mStorePtr(*mBuffer)
    {
    }

    // Copy constructor (preserves source)
    Masker(Masker& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(initSize + repSize)
    , mStorePtr(*mBuffer)
    {
    memcpy(*mBuffer, *source.mBuffer, mBuffer.size() * sizeof(StoreType));
    }

    // move (destroys source)
    Masker& operator=(Masker&& source)
    {
    initSize = source.initSize;
    repSize = source.repSize;
    mStorePtr = source.mStorePtr;
    mBuffer = std::move(source.mBuffer);
    return *this;
    }

    // Construct for a single prime
    Masker(PRIME prime)
    : initSize((prime/modulus)+1)
    , repSize(prime)
    , mBuffer(initSize + repSize)
    , mStorePtr(*mBuffer)
    {
    // Pre-zero the buffer. For bigger primes we don't
    write every word.
    // This is fast enough not to care about excess writes.
    memset(mStorePtr, 0, mBuffer.size() * sizeof(StoreType));

    // The max prime we'll store
    PRIME const maxPrime = (initSize + repSize) * modulus;

    // Filter all the bits. Note that although we
    // don't strictly need to filter from prime*3,
    // but only from prime**2, doing from *3 makes
    // the init block smaller.
    PRIME const prime2 = prime * 2;
    PRIME const prime3 = prime * 3;
    PRIME const modstep = prime % modulus;
    PRIME const wordstep = prime / modulus;
    PRIME mod = prime3 % modulus;
    PRIME word = prime3 / modulus;

    do
    {
    auto const & mask = mMaskLookup[mod];
    if (mask)
    mStorePtr[word] |= mask;
    mod += modstep;
    if (mod >= modulus)
    {
    mod -= modulus;
    word += wordstep + 1;
    }
    else word += wordstep;
    }
    while (word < mBuffer.size());
    }

    // Construct from two others, making a mask that excludes all
    numbers
    // marked not prime in either of them. The output init block
    // will be the largest from either of the inputs; the repeat
    block size will be the
    // product of the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    Masker(const Masker& left, const Masker& right, size_t
    storeLimit = 0)
    : initSize(std::max(left.initSize, right.initSize))
    , repSize (left.repSize * right.repSize)
    , mBuffer()
    {
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;

    auto storeSize = initSize + repSize;

    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;


    auto li = left.begin();
    auto ri = right.begin();

    for (size_t word = 0; word < storeSize; ++word)
    {
    mStorePtr[word] = li.next() | ri.next();
    }
    }

    // Construct from several others, making a mask that excludes
    all numbers
    // marked not prime in any of them. The output init block
    // will be the largest from any of the inputs; the repeat size
    will be the
    // product of all the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    // the type iterator should be an iterator into a collection of Masker-s.
    template <typename iterator> Masker(const iterator& begin,
    const iterator& end, size_t storeLimit = 0)
    : initSize(0)
    , repSize(1)
    , mBuffer() // empty for now
    {
    // Iterate over the inputs. We will
    // * Determine the maximum init size
    // * Determine the product of all the repeat sizes.
    // * Record the number of primes we represent.
    size_t nInputs = std::distance(begin, end);
    std::vector<MaskIterator> iterators;
    iterators.reserve(nInputs);

    for (auto i = begin; i != end; ++i)
    {
    initSize = std::max(initSize, i->initSize);
    repSize *= i->repSize;
    iterators.push_back(i->begin());
    }
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;
    auto storeSize = initSize + repSize;
    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;

    // take the last one off (most efficient to remove)
    // and use it as the initial mask value
    auto last = iterators.back();
    iterators.pop_back();
    for (auto word = 0; word < storeSize; ++word)
    {
    StoreType mask = last.next();
    for(auto &i: iterators)
    {
    mask |= i.next();
    }
    mStorePtr[word] = mask;
    }
    }

    template <typename iterator> Masker& andequals(size_t
    cachesize, iterator begin, iterator end)
    {
    // first is the prime itself; second is the current
    multiple of it _without_ the last bit
    struct value
    {
    PRIME const modstep;
    size_t const wordstep;
    PRIME mod;
    size_t word;
    value(PRIME prime)
    : modstep(prime % modulus)
    , wordstep(prime / modulus)
    {
    PRIME const primeS = prime * prime;
    mod = primeS % modulus;
    word = primeS / modulus;
    }
    bool operator<(size_t limit) const
    {
    return word < limit;
    }
    void next(StoreType * store)
    {
    auto const & mask = mMaskLookup[mod];
    if (mask)
    store[word] |= mask;
    mod += modstep;
    if (mod >= modulus)
    {
    mod -= modulus;
    word += wordstep + 1;
    }
    else word += wordstep;
    }
    };
    std::vector<value> values;
    values.reserve(std::distance(begin, end));
    for (auto prime = begin; prime != end; ++prime)
    {
    values.emplace_back(*prime);
    }

    size_t limit = 0;
    do
    {
    limit = std::min(limit + cachesize, initSize + repSize);
    for (auto & value: values)
    {
    while (value < limit)
    value.next(mStorePtr);
    }
    limit += cachesize;
    }
    while (limit < initSize + repSize);

    return *this;
    }

    // duplicates the data up to the amount needed for a prime newSize
    void resize(PRIME newSize)
    {
    size_t sizeWords = (newSize / modulus) + 1;
    assert(sizeWords > (initSize + repSize)); // not
    allowed to shrink!
    mBuffer.resize(sizeWords);
    mStorePtr = *mBuffer;

    auto copySource = mStorePtr + initSize;
    do
    {
    auto copySize = std::min(sizeWords - repSize - initSize, repSize);
    auto dest = copySource + repSize;
    memcpy(dest, copySource, copySize * sizeof(StoreType));
    repSize += copySize;
    } while ((initSize + repSize) < sizeWords);
    }

    // returns true if this mask thinks value is prime.
    bool get(PRIME value) const
    {
    assert(value < modulus * mBuffer.size());

    if ((value <= 3) || (value == 5))
    return true;

    auto mod = value % modulus;
    auto mask = mMaskLookup[mod];
    if (mask == 0)
    return false;

    auto word = value / modulus;
    auto ret = mStorePtr[word] & mask;
    return !ret;
    }

    // Get the beginning of the bitmap.
    // Incrementing this iterator can continue indefinitely,
    regardless of the bitmap size.
    MaskIterator begin() const
    {
    return MaskIterator(this);
    }

    size_t size() const
    {
    return initSize + repSize;
    }
    };

    template<> uint8_t const Masker<uint8_t, 30>::mMaskLookup[30] =
    {
    0, // 0
    0x01, // 1
    0, // 2
    0, // 3
    0, // 4
    0, // 5
    0, // 6
    0x02, // 7
    0, // 8
    0, // 9
    0, // 10
    0x04, // 11
    0, // 12
    0x08, // 13
    0, // 14
    0, // 15
    0, // 16
    0x10, // 17
    0, // 18
    0x20, // 19
    0, // 20
    0, // 21
    0, // 22
    0x40, // 23
    0, // 24
    0, // 25
    0, // 26
    0, // 27
    0, // 28
    0x80 // 29
    };

    typedef uint8_t StoreType;
    typedef Masker<StoreType> Mask;
    #ifndef tinyCount
    constexpr size_t tinyCount = 7;
    #endif
    #ifndef smallCount
    constexpr size_t smallCount = 10;
    #endif
    #ifndef bigCache
    constexpr size_t bigCache = 0x80000;
    #endif

    int main(int argc, char**argv)
    {
    // find out if the user asked for a different number of primes
    PRIME PrimeCount = 1e9;
    if (argc >= 2)
    {
    std::istringstream arg1(argv[1]);
    std::string crud;
    arg1 >> PrimeCount >> crud;
    if (!crud.empty() || PrimeCount < 3)
    {
    double isitfloat;
    arg1.seekg(0, arg1.beg);
    crud.clear();
    arg1 >> isitfloat >> crud;
    if (!crud.empty() || isitfloat < 3)
    {
    std::cerr << "Invalid first argument
    \"" << argv[1] << "\" Should be decimal number of primes required\n";
    exit(42);
    }
    else PrimeCount = isitfloat;
    }
    }

    std::string outName;
    if (argc >= 3)
    {
    outName = argv[2];
    }

    #ifdef FAST
    for (size_t bigCache = 0x80000; bigCache < 0x80001; ++bigCache)
    #endif
    {
    Mask primes;
    PRIME nextPrime;
    {
    nextPrime = 7;
    Mask tiny(nextPrime);
    std::vector<Mask> smallMasks;
    smallMasks.emplace_back(tiny);

    // Make a mask for the really small primes
    size_t count = 1; // allows for the 3
    for ( ; count < tinyCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    tiny = Mask(tiny, nextPrime);
    }

    // this limit is too small, but is easy to
    calculate and big enough
    auto limit = nextPrime*nextPrime;

    smallMasks.clear();
    for (; count < smallCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    smallMasks.emplace_back(nextPrime);
    }
    assert(nextPrime <= limit && "If this triggers smallCount is too big and the code will give wrong results");

    // Make a single masker for all the small
    primes. Don't make it bigger than needed.
    auto small = Mask(smallMasks.begin(), smallMasks.end(), PrimeCount / 30 + 1);
    primes = Mask(tiny, small, PrimeCount / 30 + 1);
    }

    // if the buffer isn't big enough yet expand it by
    duplicating early entries.
    if (primes.size() < PrimeCount / 30 + 1)
    {
    primes.resize(PrimeCount);
    }

    do
    {
    std::vector<PRIME> quickies;
    PRIME quickMaxPrime =
    std::min(nextPrime*nextPrime, PRIME(sqrt(PrimeCount) + 1));
    do
    {
    do
    {
    nextPrime += 2;
    } while (!primes.get(nextPrime));
    quickies.emplace_back(nextPrime);
    } while (nextPrime < quickMaxPrime);

    // this function adds all the quickies into the
    mask, but does all of them in
    // one area of memory before moving onto the
    next. The area of memory,
    // and the list of primes, both fit into the L1
    cache.
    // You may need to adjust bigCache for best performance.
    primes.andequals(bigCache, quickies.begin(), quickies.end());
    } while (nextPrime*nextPrime < PrimeCount);

    std::cerr << "tinyCount " << tinyCount
    << " smallCount " << smallCount << " bigCache "
    << std::hex << "0x" << bigCache << std::endl;
    if (outName.length())
    {
    std::ofstream outfile(outName);
    outfile << "2\n";
    for (PRIME prime = 3; prime < PrimeCount; prime
    += 2)
    {
    if (primes.get(prime))
    {
    outfile << prime << '\n';
    }
    }
    }
    }
    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Jul 1 23:20:27 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 19/06/2024 01:34, Tim Rentsch wrote:

    Would you mind if I asked you to post the code so I can
    take a look at it? Or if you would rather email it to
    me that also works.

    I've hacked out all my include files, added a GPL licence, and
    included build instructions. It follows in the signature block.
    600 lines :(

    I bet the line wrapping breaks it too.

    I got the code. There were some line wrappings but I just went
    through and fixed those by hand. After fixing the problem line
    wraps I was able to compile and run the code (I used std=c++17,
    which apparently works okay). I did a simple check to make sure
    the primes being found are right, which they do indeed seem to
    be. I haven't really experimented with changing the code; I
    only just ran it with various bounds on what primes to find,
    mostly to get a sense of the performance.

    I admit I don't understand the scheme the code is using. It
    looks to me like the code is doing divisions and remainders a
    lot, which my code basically doesn't do at all (it does do one
    division for each prime up to the square root of the limit of
    primes being sought, but that's all). I know the big template
    near the beginning is crucial to understanding what the code is
    doing, but I haven't been able to figure out what abstraction
    that template is supposed to represent. The core of my program
    is between 50 and 60 lines, and all pretty straightforward.

    Is there some suggestion you could make or a question you would
    like to ask? What can I do that might be helpful?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Tue Jul 2 21:24:48 2024
    On 02/07/2024 07:20, Tim Rentsch wrote:
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 19/06/2024 01:34, Tim Rentsch wrote:

    Would you mind if I asked you to post the code so I can
    take a look at it? Or if you would rather email it to
    me that also works.

    I've hacked out all my include files, added a GPL licence, and
    included build instructions. It follows in the signature block.
    600 lines :(

    I bet the line wrapping breaks it too.

    I got the code. There were some line wrappings but I just went
    through and fixed those by hand. After fixing the problem line
    wraps I was able to compile and run the code (I used std=c++17,
    which apparently works okay). I did a simple check to make sure
    the primes being found are right, which they do indeed seem to
    be. I haven't really experimented with changing the code; I
    only just ran it with various bounds on what primes to find,
    mostly to get a sense of the performance.

    I admit I don't understand the scheme the code is using. It
    looks to me like the code is doing divisions and remainders a
    lot, which my code basically doesn't do at all (it does do one
    division for each prime up to the square root of the limit of
    primes being sought, but that's all). I know the big template
    near the beginning is crucial to understanding what the code is
    doing, but I haven't been able to figure out what abstraction
    that template is supposed to represent. The core of my program
    is between 50 and 60 lines, and all pretty straightforward.

    Is there some suggestion you could make or a question you would
    like to ask? What can I do that might be helpful?

    The mods and divs should _not_ be in the main path. It would be
    interesting to see your code. Or for that matter to see a performance
    report from it.

    The template is mostly just to allow me to experiment with the size of
    the stored data. That's not easy with the mod30 scheme, but with the
    original one I found the code ran faster when I did everything with
    uint64_t data.

    Perhaps I could use mod (8*30)!

    Something that has come to mind as an enhancement - at the moment for
    larger primes it thinks about all the odd numbers that are a multiple of
    the prime, and sets them in the mask. It doesn't need to do them all -
    for each prime there's a set of steps that can be done. For 7, for
    example, you start at 49, then add at each step 7 multiplied by 4 2 4 2
    4 6 2 6. That pattern of 8 repeats infinitely. It's the same for all the
    other primes, a pattern of 8.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Wed Jul 3 11:25:15 2024
    On 02/07/2024 07:20, Tim Rentsch wrote:
    I got the code. There were some line wrappings but I just went
    through and fixed those by hand. After fixing the problem line
    wraps I was able to compile and run the code (I used std=c++17,
    which apparently works okay). I did a simple check to make sure
    the primes being found are right, which they do indeed seem to
    be. I haven't really experimented with changing the code; I
    only just ran it with various bounds on what primes to find,
    mostly to get a sense of the performance.

    I admit I don't understand the scheme the code is using. It
    looks to me like the code is doing divisions and remainders a
    lot, which my code basically doesn't do at all (it does do one
    division for each prime up to the square root of the limit of
    primes being sought, but that's all). I know the big template
    near the beginning is crucial to understanding what the code is
    doing, but I haven't been able to figure out what abstraction
    that template is supposed to represent. The core of my program
    is between 50 and 60 lines, and all pretty straightforward.

    Is there some suggestion you could make or a question you would
    like to ask? What can I do that might be helpful?

    I checked. The modulus operations (%) are easy to check for.

    There are two for each prime number. Not for each multiple of it, but
    for the primes themselves.

    And there's one in the "get" function that checks to see if a number is
    prime.

    The complexity is at least partly due to an optimisation.

    You'll recall perhaps that Bonita's original code preset the entire
    bitmap to aa - which is faster than setting all the numbers divisible by
    2. I pointed out that you don't even need to set them, and set off to
    write code that stored odd numbers only.

    But then I realised that if you store the mask for 3 only you get a
    pattern. It has one odd word at the beginning, then a repeating pattern
    of 3 words. It doesn't matter what the word size is.

    This is equally true of 5, 7, 11 and the first few primes. It's also
    true if you set the mask up for 3 and 5 together - except this time the
    repeat length is 15.

    And it's far faster to duplicate those 3, or 15, or 3*5*7, words than to
    do each prime individually.

    There comes a point where it isn't worth it though - the repeat length
    for all the primes up to 31 is 1.5E12.

    Exactly when it stops being worthwhile isn't obvious.

    Then you came along with the mod30 suggestion. There is still a repeat
    length though, and it's still worth merging masks for small primes
    together rather than doing each one individually. But the code is
    bigger. MUCH bigger.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Jul 15 06:15:06 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:

    I got the code. There were some line wrappings but I just went
    through and fixed those by hand. After fixing the problem line
    wraps I was able to compile and run the code (I used std=c++17,
    which apparently works okay). I did a simple check to make sure
    the primes being found are right, which they do indeed seem to
    be. I haven't really experimented with changing the code; I
    only just ran it with various bounds on what primes to find,
    mostly to get a sense of the performance.

    I admit I don't understand the scheme the code is using. It
    looks to me like the code is doing divisions and remainders a
    lot, which my code basically doesn't do at all (it does do one
    division for each prime up to the square root of the limit of
    primes being sought, but that's all). I know the big template
    near the beginning is crucial to understanding what the code is
    doing, but I haven't been able to figure out what abstraction
    that template is supposed to represent. The core of my program
    is between 50 and 60 lines, and all pretty straightforward.

    Is there some suggestion you could make or a question you would
    like to ask? What can I do that might be helpful?

    I checked. The modulus operations (%) are easy to check for.

    There are two for each prime number. Not for each multiple of it, but
    for the primes themselves.

    And there's one in the "get" function that checks to see if a number
    is prime.

    The complexity is at least partly due to an optimisation.

    It's taken me a while to get into this. I think I understand
    your code fairly well now. My idea is to respond in pieces,
    starting with the simpler aspects first.

    You'll recall perhaps that Bonita's original code preset the entire
    bitmap to aa - which is faster than setting all the numbers divisible
    by 2. I pointed out that you don't even need to set them, and set off
    to write code that stored odd numbers only.

    But then I realised that if you store the mask for 3 only you get a
    pattern. It has one odd word at the beginning, then a repeating
    pattern of 3 words. It doesn't matter what the word size is.

    This is equally true of 5, 7, 11 and the first few primes. It's also
    true if you set the mask up for 3 and 5 together - except this time
    the repeat length is 15.

    And it's far faster to duplicate those 3, or 15, or 3*5*7, words than
    to do each prime individually.

    There comes a point where it isn't worth it though - the repeat length
    for all the primes up to 31 is 1.5E12.

    Exactly when it stops being worthwhile isn't obvious.

    Then you came along with the mod30 suggestion. There is still a repeat length though, and it's still worth merging masks for small primes
    together rather than doing each one individually. But the code is
    bigger. MUCH bigger.

    I am assuming a mod30 representation in all of my followup
    comments. Just to be explicit, there is an array of 8-bit
    bytes, with bytes[i] indicating primeness or compositeness
    for the 8 numbers 30*i + { 1, 7, 11, 13, 17, 19, 23, 29 },
    one bit for each of those 8 possibilities. (There may be
    optimizations if the array elements are 64-bit quantities
    rather than 8-bit quantities, but let's ignore that for
    now.)

    There are two plausible strategies for "pre-setting" the
    multiples of small primes. The first strategy uses two stages.
    The first stage makes a mask for the four primes 7, 11, 13, and
    17, which has 17017 elements, and simply replicates that mask
    over and over through the entire array. The second stage makes a
    mask for the three primes 19, 23, and 29, which has 12673
    elements, and uses bitwise operations to combine that mask
    throughout the primes array. Incidentally the convention I chose
    is that a 1 bit means the number is definitely composite, so the
    array could be initialized with all zero bits.

    The second strategy makes a mask for all seven of 7, 11, 13, 17,
    19, 23, and 29, which has 215656441 elements. This mask is then
    copied over and over throughout the array. Because 215656441 is
    quite a bit larger than your L1 cache number, it might be a win
    to divide the mask into strips of size (L1 cache / 2), and copy
    each strip separately, to get better cache performance.

    Note that for the "setting" step (that is, the second strategy
    or the first stage of the first strategy), the mask can be formed
    in the initial part of the prime/composite array. After taking
    care of all multiples of the seven primes in the initial byte,
    the initial byte should be set to indicate that those numbers
    are prime, since otherwise they would be set wrongly.

    I haven't experimented to see which of the above ideas has better
    performance. Like you say it's the small primes that matter,
    because they have so many multiples, and it isn't obvious where
    the cutoff should be. But it's clear that every prime in the
    first byte is worth doing, so you might try playing around with
    just that part to see which approach does better. I suggest
    using a primes array that is as large as the memory on your test
    system can accommodate without swapping.

    I might be able to write followon segments at the rate of one per
    day or so. I'm not expecting to do them any faster than that,
    and very likely will miss some days because of other activities.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sat Jul 20 07:41:18 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:

    I got the code. [...]


    I checked. The modulus operations (%) are easy to check for.

    There are two for each prime number. Not for each multiple of it, but
    for the primes themselves.

    And there's one in the "get" function that checks to see if a number
    is prime. [...]

    And one divide for each modulus.

    Both divide and mod can be simplified if we use a different
    representation for numbers. The idea is to separate the
    number into a mod 30 part and divided by 30 part. An
    arbitrary number N can be split into an ordered pair

    N/30 , N%30

    stored in, let's say, an 8 byte quantity with seven bytes
    for N/30 and one byte for N%30. Let me call these two
    parts i and a for one number N, and j and b for a second
    number M. Then two form N*M we need to find

    (i*30+a) * (j*30+b)

    which is

    (i*30*j*30) + (i*30*b) + (j*30*a) + a*b

    Of course we want to take the product and express it in
    the form (product/30, product%30), which can be done by

    30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

    The two numbers a and b are both under 30, so a*b/30 and
    a*b%30 can be done by lookup in a small array.

    30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

    Now all operations can be done using just multiplies and
    table lookups. Dividing by 30 is just taking the first
    component; remainder 30 is just taking the second component.

    One further refinement: for numbers relatively prime to 2,
    3, and 5, there are only 8 possibles, so the modulo 30
    component can be reduced to just three bits. That makes the
    tables smaller, at a cost of needing a table lookup to find
    and 'a' or 'b' part.

    Does this all make sense?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Tue Jul 23 07:34:14 2024
    Vir Campestris <vir.campestris@invalid.invalid> writes:

    [...]

    I was hoping for some feedback, even if very brief,
    before I continue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)