I've parallelized my sieve of Erastosthenes to all cores. The parallel threads do al the punching of the non-prime multiplex beyond sqrt( max
). I found that the algorithm is limited by the memory bandwith since
the bit-range between sqrt( max ) and max is to large to fit inside
the cache. So I partitioned the each thread to a number of bit-ranges
that fits inside the L2-cache. With that I got a speedup of factor 28
when calculating about the first 210 million primes, i.e. all primes
<= 2 ^ 32.
On my Zen4 16 core PC under Windows with clang-cl 16.0.5 producing the
primes without any file output is only 0.28s. On my Linux-PC, a Zen2 64
core CPU producing the same number of primes is about 0.09s.
The file output is done with my own buffering. With that writing the mentioned prime number range results in a 2.2GB file which is written
in about two seconds with a PCIe 4.0 SSD.
The first parameter is the highest prime candidate to be generated,
it can be prefixed with 0x for hex values. The second number is the
name of the output file; it can be "" to suppress file output. The
third parameter is the number of punching threads; if it is left the
number of threads defaults to the number of threads reported by the
runtime.
Am 10.12.2023 um 22:48 schrieb Vir Campestris:
- You don't need to store the odd numbers
All primes except 2 are odd.
- You only need a bitmask to save the sieve.
I'm using the "scan"-lambda which does serarch through the sieve
as fast as possible with countr_zero, which maps to a single CPU
-instruction on modern processors.
... reading that right you've assumed that unsigned is the same
size as size_t, and that the size is 64 bits.
I've got a type word_t that is a size_t, but it can also
be a uint8_t to test what's the performance difference.
Well, I went away and tried it. I didn't write anything nearly as sophisticated as yours - I didn't worry about threads and caching. The
code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to store
odd numbers only, but also miss out the ones divisible by 3, or other
higher primes. The results were:
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an improvement,
but not as much as I expected. Perhaps it's got coo complicated. I don't
have a high powered PC.
Am 11.12.2023 um 18:12 schrieb Vir Campestris:
Because all primes except 2 are odd you don't need to store the _even_
numbers, which is what I meant to say. Or else you only need to store
the odd ones.
I'll try that the next days; but I don't expect a significant change.
Your constants are 64 bit hexadecimal numbers (from counting the
digits) and are unsigned (the u suffix) but there's no guarantee that
size_t will be the same size as unsigned, nor that it will be 64 bit.
The constants are wide enough for 64 bit size_t-s but it won't make
a functional difference if you define word_t as a uint8_t. With an
uint8_t word_t the code runs about 1/6-th slower; that's less than
I expected.
On 11/12/2023 17:19, Bonita Montero wrote:
Am 11.12.2023 um 18:12 schrieb Vir Campestris:Well, I went away and tried it. I didn't write anything nearly as sophisticated as yours - I didn't worry about threads and caching. The
Because all primes except 2 are odd you don't need to store the
_even_ numbers, which is what I meant to say. Or else you only need
to store the odd ones.
I'll try that the next days; but I don't expect a significant change.
code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to store
odd numbers only, but also miss out the ones divisible by 3, or other
higher primes. The results were:
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an improvement,
but not as much as I expected. Perhaps it's got coo complicated. I don't
have a high powered PC.
Now the code runs 0.15s on the 16 core Zen4 system, that's +87% faster.
On th4e 64 core Zen2 system the time to produce the first 21 million
primes is about 0.05 seconds.
On 12/13/2023 7:16 AM, Vir Campestris wrote:
On 11/12/2023 17:19, Bonita Montero wrote:
Am 11.12.2023 um 18:12 schrieb Vir Campestris:
Because all primes except 2 are odd you don't need to store the
_even_ numbers, which is what I meant to say. Or else you only
need to store the odd ones.
I'll try that the next days; but I don't expect a significant change.
Well, I went away and tried it. I didn't write anything nearly as
sophisticated as yours - I didn't worry about threads and
caching. The code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to
store odd numbers only, but also miss out the ones divisible by 3,
or other higher primes. The results were:
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an
improvement, but not as much as I expected. Perhaps it's got coo
complicated. I don't have a high powered PC.
Another way to attempt to optimize is that except for 2 and 3, all
primes are of the form 6n+1 or 6n-1.
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
A more compact form is to consider numbers mod 30, in which case
numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
so each block of 30 can be represented in one 8-bit byte. Doing
that gives a 25% reduction in space compared to a 6n+/-1 scheme.
Vir Campestris <vir.campestris@invalid.invalid> writes:
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
Do you mean 1e9 primes or just the primes less than 1e9? To
do the first 1e9 primes a sieve would need to go up to about
23.9e9 (so half that many bits if only odd numbers were
represented).
On 23/12/2023 18:30, Tim Rentsch wrote:
A more compact form is to consider numbers mod 30, in which case
numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
so each block of 30 can be represented in one 8-bit byte. Doing
that gives a 25% reduction in space compared to a 6n+/-1 scheme.
I found that on my system the modulo operation was so slow this wasn't
worth doing.
On 23/12/2023 18:21, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
Do you mean 1e9 primes or just the primes less than 1e9? To
do the first 1e9 primes a sieve would need to go up to about
23.9e9 (so half that many bits if only odd numbers were
represented).
Primes up to 1e9.
I have another idea though, watch this space...
On 12/26/2023 1:27 AM, Bonita Montero wrote:
Am 26.12.2023 um 06:39 schrieb Chris M. Thomasson:
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to
artificially offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
So, are you familiar with Intel's early hyper threading problem? There
was false sharing between the hyperhtreads. The workaround did improve performance by quite a bit. IIRC, my older appcore project had this workaround incorporated into it logic. I wrote that sucker back in very
early 2000's. Humm... I will try to find the exact line.
Am 27.12.2023 um 06:59 schrieb Chris M. Thomasson:
Something about false interference between threads that should not even
be interacting with one another to begin with. It was a problem. The fix
was based on alloca, so that is something to ponder on.;
Like with any SMT-core there could be cache-thrashing between the
cores. The L1 data cache was only 8kB, maybe only two was associative
that could be thrashing between the cores. But I've no clue what this
would have to to with alloca().
Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:
On 12/28/2023 7:17 PM, Bonita Montero wrote:
Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:
The use of alloca to try to get around the problem in their
(Intel's) early hyperthreaded processors was real, and actually
helped. It was in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:
On 12/28/2023 7:17 PM, Bonita Montero wrote:
Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
Depending on how the code is written, no modulo operations need
to be done, because they will be optimized away and done at
compile time. If we look at multiplying two numbers represented
by bits in bytes i and j, the two numbers are
i*30 + a
j*30 + b
for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.
The values we're interested in are the index of the product and
the residue of the product, namely
(i*30+a) * (j*30+b) / 30 (for the index)
(i*30+a) * (j*30+b) % 30 (for the residue)
Any term with a *30 in the numerator doesn't contribute to
the residue, and also can be combined with the by-30 divide
for computing the index. Thus these expressions can be
rewritten as
i*30*j + i*b + j*a + (a*b/30) (for the index)
a*b%30 (for the residue)
When a and b have values that are known at compile time,
neither the divide nor the remainder result in run-time
operations being done; all of that heavy lifting is
optimized away and done at compile time. Of course there
are some multiplies, but they are cheaper than divides, and
also can be done in parallel. (The multiplication a*b also
can be done at compile time.)
The residue needs to be turned into a bit mask to do the
logical operation on the byte of bits, but here again that
computation can be optimized away and done at compile time.
Does that all make sense?
Am 29.12.2023 um 05:58 schrieb Chris M. Thomasson:
On 12/28/2023 7:17 PM, Bonita Montero wrote:
Am 29.12.2023 um 00:38 schrieb Chris M. Thomasson:
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
On 12/29/2023 2:09 PM, Chris M. Thomasson wrote:
On 12/29/2023 8:01 AM, Scott Lurndal wrote:
page 2-35 in the_Intel Pentium 4 Processor Optimization_
manual.
I think it was in chapter 5 of Developing Multithreaded Applications: A
Platform Consistent Approach cannot remember the damn section right now.
Wait a minute! I might have found it, lets see:
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
Am 29.12.2023 um 23:12 schrieb Chris M. Thomasson:
Wait a minute! I might have found it, lets see:
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
This is nonsense what it says if the cache is really four-way
associative like in the other paper mentioned here. And of
course that has nothing to do with false sharing
Am 30.12.2023 um 05:56 schrieb Kaz Kylheku:
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. ...
With four-way associativity that's rather unlikely.
Am 29.12.2023 um 13:56 schrieb David Brown:
There is nothing different from alloca() and ordinary stack
allocations. Â But alloca() makes it quick and easy to make large
allocations, and to do so with random sizes (or at least sizes that
differ significantly between threads, even if they are running the
same code).
I've got my own class overflow_array<> which is like an array<> and a vector<>. If you append more than the internal array<> can handle the
objects are moved to an internal vector. I think Boost's small_array
is similar to that.
Without alloca(), you'd need to do something like arranging to call a
recursive function a random number of times before it then calls the
next bit of code in your thread. alloca() is simply far easier and
faster.
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
The point is that if there's a fixed upper limit you would
allocate you could allocate it always statically.
Am 30.12.2023 um 06:51 schrieb Kaz Kylheku:
Under four-way associativity, five isn't greater than four?
Two-way associativeness would leave no space if both threads have
synchronous stack frames. With four-way associativeness there's
not much likehood for that to happen.
On 12/29/2023 8:45 PM, Bonita Montero wrote:
Am 29.12.2023 um 18:29 schrieb Kaz Kylheku:
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
According to the paper Scott mentioned the associativity of the
Pentium 4's L1 data cache is four. With that it's not necessary
to have such aliasing preventions.
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! Although, it seems like you simply do
not actually _understand_ what is going on here...
Intel's suggestions for how to mitigate the problem in their earlier hyperhtreaded processors actually worked wrt improving performance. Keep
in mind this was a while back in 2004-2005. I am happy that the way back machine has my older code.
Am 30.12.2023 um 21:35 schrieb Kaz Kylheku:
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
Am 31.12.2023 um 08:01 schrieb Kaz Kylheku:
On 2023-12-31, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 30.12.2023 um 21:35 schrieb Kaz Kylheku:
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
What? The associativity of the cache is not the determiner
of program behavior; if the program accesses five different
areas whose addresses are the same modulo 65536 bytes, ...
The set size is smaller than 64kB an of course there can be aliasing
but with four way associativity it's unlikely that there's a need to
control aliasing. And if the set size would be 64kB aliasing would be
even less likely through page colouring.
Am 30.12.2023 um 05:56 schrieb Kaz Kylheku:
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. ...
With four-way associativity that's rather unlikely.
On 29/12/2023 17:04, Bonita Montero wrote:
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
Citation? You are usually wrong in your claims, or at least mixed-up,
so I won't trust you without evidence. (That does not mean you are
wrong here - I don't know either way.)
Of course, avoiding alloca() within the kernel is, again, utterly
irrelevant to the point under discussion.
Am 31.12.2023 um 19:44 schrieb Scott Lurndal:
Why? ...
Because there must be set-conflicts on the same line index.
That's rather unlikely with four-way associativeness.
David Brown <david.brown@hesbynett.no> writes:
On 29/12/2023 17:04, Bonita Montero wrote:
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
Citation? You are usually wrong in your claims, or at least mixed-up,
so I won't trust you without evidence. (That does not mean you are
wrong here - I don't know either way.)
Kernel threads generally run with a small, fixed stack. Stack-based
dynamic allocation is generally avoided. I don't know of any
hard restrictions, but I suspect that Linus would look askance
at it.
Of course, avoiding alloca() within the kernel is, again, utterly
irrelevant to the point under discussion.
Very true
Am 06.01.2024 um 04:21 schrieb Chris M. Thomasson:
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
On 1/7/2024 9:48 PM, Bonita Montero wrote:
Am 07.01.2024 um 21:46 schrieb Chris M. Thomasson:
I know that they had a problem and the provided workaround from Intel
really did help out. ...
Absolutely not, not with four way associativity.
Whatever you say; Sigh. I am done with this.
Am 07.01.2024 um 21:46 schrieb Chris M. Thomasson:
I know that they had a problem and the provided workaround from Intel
really did help out. ...
Absolutely not, not with four way associativity.
On 24/12/2023 08:36, Tim Rentsch wrote:
Depending on how the code is written, no modulo operations need
to be done, because they will be optimized away and done at
compile time. If we look at multiplying two numbers represented
by bits in bytes i and j, the two numbers are
i*30 + a
j*30 + b
for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.
The values we're interested in are the index of the product and
the residue of the product, namely
(i*30+a) * (j*30+b) / 30 (for the index)
(i*30+a) * (j*30+b) % 30 (for the residue)
Any term with a *30 in the numerator doesn't contribute to
the residue, and also can be combined with the by-30 divide
for computing the index. Thus these expressions can be
rewritten as
i*30*j + i*b + j*a + (a*b/30) (for the index)
a*b%30 (for the residue)
When a and b have values that are known at compile time,
neither the divide nor the remainder result in run-time
operations being done; all of that heavy lifting is
optimized away and done at compile time. Of course there
are some multiplies, but they are cheaper than divides, and
also can be done in parallel. (The multiplication a*b also
can be done at compile time.)
The residue needs to be turned into a bit mask to do the
logical operation on the byte of bits, but here again that
computation can be optimized away and done at compile time.
Does that all make sense?
Right now, no. But that's me. I'll flag it to read again when I've
had a better night's sleep.
When I compared the output there were differences. Specifically your
program claimed that these numbers
66049,67591,69133,69647,71189,72217,72731,75301,78899,79927,80441,81469,85067,86609,89179,
89693,90721,92263,94319,95861,97403,98431,99973,102029,103057,105113,107683,108197,110767,
111281,112823,113851,115393,117449,118477,118991,120019,123103,125159,126187,128243,129271,
130813,133897,134411,139037,140579,143149,144691,146233,146747,148289,150859,152401,153943,
154457,155999,157541,158569,159083,162167,164737,165251,166279,167821,169363,169877,172961,
173989,175531,177587,180157,182213,184783,186839,188381,189923,190951,193007,194549,195577,
197633,198661,202259,204829,207913,208427,210997,211511,212539,213053,215623,219221,220249,
220763,221791,225389,226417,226931,227959,233099,234127,236183,238753,240809,241837,243379,
244921,248519,249547,251089,252631,254687,256229,259313,260341,261883,262397,264967,265481,
267023,269593,270107,272677,273191,274733,279359,280387,280901,281929,283471,285013,287069,
288611,290153,295807,296321,298891,300947,303517,305059,306601,308657,311741,312769,314311,
315853
are all prime. They aren't. The first one is 257 squared, and the others
all have factors too.
I've been playing with this. Has it really been this long? I
ought to have more time than this... [...]
I missed Tim's Mod 30 trick, and that might well help. But I
think I'm bored with this. It would save a lot of memory, but
the extra computation might make it slower.
Maybe I'll try one day ;)
Vir Campestris <vir.campestris@invalid.invalid> writes:
I've been playing with this. Has it really been this long? I
ought to have more time than this... [...]
I missed Tim's Mod 30 trick, and that might well help. But I
think I'm bored with this. It would save a lot of memory, but
the extra computation might make it slower.
Two comments. Using the mod 30 encoding can be speed competitive
with simpler encodings. The second comment is, it isn't just
that less memory is used, but that more primes can be found.
Bonita brags about how fast some code is, but it's an apples
and oranges comparison - that code doesn't do the job because
it runs into a memory limit way sooner than using a mod 30
encoding, and trying to run past that limit would make the
code R E A L L Y S L O W.
Maybe I'll try one day ;)
I hope you do. If you have some difficulty seeing how to
avoid the seeming speed penalties, feel free to ask, I am
happy to help.
Allocating a zero filled buffer of enormous size with calloc is
surprisingly fast. I have a feeling (which I can't prove) that under the
hood Linux is allocating one zero filled page to me, and setting up the
page tables for the rest of the buffer as unwriteable. It then copies
the zero buffer whenever a write fault happens.
On 22/05/2024 03:06, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
I've been playing with this. Has it really been this long? I
ought to have more time than this... [...]
I missed Tim's Mod 30 trick, and that might well help. But I
think I'm bored with this. It would save a lot of memory, but
the extra computation might make it slower.
Two comments. Using the mod 30 encoding can be speed competitive
with simpler encodings. The second comment is, it isn't just
that less memory is used, but that more primes can be found.
Bonita brags about how fast some code is, but it's an apples
and oranges comparison - that code doesn't do the job because
it runs into a memory limit way sooner than using a mod 30
encoding, and trying to run past that limit would make the
code R E A L L Y S L O W.
Maybe I'll try one day ;)
I hope you do. If you have some difficulty seeing how to
avoid the seeming speed penalties, feel free to ask, I am
happy to help.
I think I've come to the end of what my code will do. [...]
Firstly the C++ bit:
std::vector<bool> is much too slow to be any use.
[various alternatives]
That's the end of the on-topic part.
I don't store even numbers. That's an optimisation, and is of course equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it
doesn't require an divide which may be slow.
When I mask out just 3 I get a pattern like this:
2492492492492490
There is a single nybble at the end of that which is zero, but the
rest is a 3-nybble repeat. [similarly for 3 and 5, etc...]
However I can build a mask from the ones for 3 and 5, which has a
repeat of 15 (3*5)
6692cd259a4b3490, 96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b, 4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25, 259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692, 92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496
and copy that.
In fact I can do that with any number of single prime masks I want.
[...]
Note that although this has been quite a fun thing to play with it's
had almost no C++ content. In fact I found that to get best
performance I had to drop down to C type code.
I kept a class to
manage my buffers though, it's much easier to drop an object that
manages a malloc-ed buffer than to remember exactly when I need to
call free. Correctness trumps speed.
Now Mod 30 - by storing the odd numbers only I'm halving the storage,
and I don't need to do an expensive divide. It's only shifts and ANDs.
If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
numbers I need to store a mask for.
Sadly my computer likes 64 bit
operations better. But the required storage is now only 27% of the
original mask if all numbers are put in it.
I can carry on with this, but the storage doesn't have any more nice
sizes. [...]
Maybe I'll play with it. But the rain stopped, and I must get on
with the weeding!
Am 30.05.2024 um 18:55 schrieb Paavo Helde:
This might well be so (a standard COW page mapped initially to a
special zeroed page), but from what google tells me there are also
special kernel threads whose task is to zero out unused memory pages
so they can be given to new processes. ...
Any reference on that ?
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 22/05/2024 03:06, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
I've been playing with this. Has it really been this long? I
ought to have more time than this... [...]
I missed Tim's Mod 30 trick, and that might well help. But I
think I'm bored with this. It would save a lot of memory, but
the extra computation might make it slower.
Two comments. Using the mod 30 encoding can be speed competitive
with simpler encodings. The second comment is, it isn't just
that less memory is used, but that more primes can be found.
Bonita brags about how fast some code is, but it's an apples
and oranges comparison - that code doesn't do the job because
it runs into a memory limit way sooner than using a mod 30
encoding, and trying to run past that limit would make the
code R E A L L Y S L O W.
Maybe I'll try one day ;)
I hope you do. If you have some difficulty seeing how to
avoid the seeming speed penalties, feel free to ask, I am
happy to help.
I think I've come to the end of what my code will do. [...]
A few miscellaneous responses...
Firstly the C++ bit:
std::vector<bool> is much too slow to be any use.
[various alternatives]
I simply used a static array, more accurately a union of
two arrays (code here is approximate only):
#define LARGEST_N 2400000000000 // primes less than 2.4e12
union {
unsigned char bytes[ LARGEST_N / 30 ];
unsigned long long ulls[ LARGEST_N / 30 / 8 ];
} all;
That's the end of the on-topic part.
I don't store even numbers. That's an optimisation, and is of course
equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it
doesn't require an divide which may be slow.
Like I keep saying, keeping the bits in a mod 30 representation
doesn't need any divides (at least not in the main loop, where
multiples are being computed).
When I mask out just 3 I get a pattern like this:
2492492492492490
There is a single nybble at the end of that which is zero, but the
rest is a 3-nybble repeat. [similarly for 3 and 5, etc...]
In a mod 30 representation, the multiples of 7 repeat after 7
bytes. If we replicate those 7 bytes 8 times, we get 7 unsigned
long longs. Those 7 ULLs can be combined with logic operations into
the all.ulls[] array. Similarly for 11, 13, etc, for as high as is reasonable (I think I topped out at 29).
However I can build a mask from the ones for 3 and 5, which has a
repeat of 15 (3*5)
6692cd259a4b3490,
96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b,
4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25,
259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692,
92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496
and copy that.
In fact I can do that with any number of single prime masks I want.
[...]
I thought about doing something along these lines but decided it
was too much work. So I just did the blocks of 7*8 bytes, 11*8
bytes, 13*8 bytes, etc, up to some fairly small prime, then
switched to an alternate method where individual bytes are
addressed and modified.
Note that although this has been quite a fun thing to play with it's
had almost no C++ content. In fact I found that to get best
performance I had to drop down to C type code.
Right. I used straight C, and I don't see any part of the
program that would benefit from having C++.
I kept a class to
manage my buffers though, it's much easier to drop an object that
manages a malloc-ed buffer than to remember exactly when I need to
call free. Correctness trumps speed.
I used static allocation for the big bitmap, and small local
arrays (less than 50 * 8 bytes) for the small temporaries.
No malloc()s or free()s to worry about. Initialized to zero
without any code needed.
Now Mod 30 - by storing the odd numbers only I'm halving the storage,
and I don't need to do an expensive divide. It's only shifts and ANDs.
If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
numbers I need to store a mask for.
By special casing each of the possibilities (there are 8*8 == 64 of
them), only multiplies and adds are needed, and no shifts -- masks
are evaluated to compile-time constants.
Sadly my computer likes 64 bit
operations better. But the required storage is now only 27% of the
original mask if all numbers are put in it.
Once the primes get above a fairly small number, the density of
places to zap is low enough so that working in bytes is okay.
In particular once we are looking at primes above 240 there is
less than one bit per 64 bit ULL, which makes using bytes
rather than ULLs less painful
I can carry on with this, but the storage doesn't have any more nice
sizes. [...]
Yeah. It's a happy coincidence that mod 30 needs 8 bits
per block-of-30, and unfortunately no other nice size in
sight for larger prime-set mods.
Maybe I'll play with it. But the rain stopped, and I must get on
with the weeding!
Now you have something to look forward to the next time
it rains. :)
On 31/05/2024 06:17, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 22/05/2024 03:06, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
I've been playing with this. Has it really been this long? I
ought to have more time than this... [...]
I missed Tim's Mod 30 trick, and that might well help. But I
think I'm bored with this. It would save a lot of memory, but
the extra computation might make it slower.
Two comments. Using the mod 30 encoding can be speed competitive
with simpler encodings. The second comment is, it isn't just
that less memory is used, but that more primes can be found.
Bonita brags about how fast some code is, but it's an apples
and oranges comparison - that code doesn't do the job because
it runs into a memory limit way sooner than using a mod 30
encoding, and trying to run past that limit would make the
code R E A L L YÂ S L O W.
Maybe I'll try one day ;)
I hope you do. If you have some difficulty seeing how to
avoid the seeming speed penalties, feel free to ask, I am
happy to help.
I think I've come to the end of what my code will do. [...]
A few miscellaneous responses...
Firstly the C++ bit:
std::vector<bool> is much too slow to be any use.
[various alternatives]
I simply used a static array, more accurately a union of
two arrays (code here is approximate only):
    #define LARGEST_N 2400000000000 // primes less than 2.4e12
    union {
        unsigned char      bytes[ LARGEST_N / 30 ];
        unsigned long long ulls[ LARGEST_N / 30 / 8 ];
    } all;
That's the end of the on-topic part.
I don't store even numbers. That's an optimisation, and is of course
equivalent to Tim's Mod 30 trick, but with mod 2 instead. And it
doesn't require an divide which may be slow.
Like I keep saying, keeping the bits in a mod 30 representation
doesn't need any divides (at least not in the main loop, where
multiples are being computed).
When I mask out just 3 I get a pattern like this:
2492492492492490
There is a single nybble at the end of that which is zero, but the
rest is a 3-nybble repeat. [similarly for 3 and 5, etc...]
In a mod 30 representation, the multiples of 7 repeat after 7
bytes. If we replicate those 7 bytes 8 times, we get 7 unsigned
long longs. Those 7 ULLs can be combined with logic operations into
the all.ulls[] array. Similarly for 11, 13, etc, for as high as is
reasonable (I think I topped out at 29).
However I can build a mask from the ones for 3 and 5, which has a
repeat of 15 (3*5)
6692cd259a4b3490,
96692cd259a4b349,496692cd259a4b34,3496692cd259a4b3,b3496692cd259a4b,
4b3496692cd259a4,a4b3496692cd259a,9a4b3496692cd259,59a4b3496692cd25,
259a4b3496692cd2,d259a4b3496692cd,cd259a4b3496692c,2cd259a4b3496692,
92cd259a4b349669,692cd259a4b34966,6692cd259a4b3496
and copy that.
In fact I can do that with any number of single prime masks I want.
[...]
I thought about doing something along these lines but decided it
was too much work. So I just did the blocks of 7*8 bytes, 11*8
bytes, 13*8 bytes, etc, up to some fairly small prime, then
switched to an alternate method where individual bytes are
addressed and modified.
Note that although this has been quite a fun thing to play with it's
had almost no C++ content. In fact I found that to get best
performance I had to drop down to C type code.
Right. I used straight C, and I don't see any part of the
program that would benefit from having C++.
I kept a class to
manage my buffers though, it's much easier to drop an object that
manages a malloc-ed buffer than to remember exactly when I need to
call free. Correctness trumps speed.
I used static allocation for the big bitmap, and small local
arrays (less than 50 * 8 bytes) for the small temporaries.
No malloc()s or free()s to worry about. Initialized to zero
without any code needed.
Now Mod 30 - by storing the odd numbers only I'm halving the storage,
and I don't need to do an expensive divide. It's only shifts and ANDs. >>>
If I use mod 30 (30 = 2*3*5) conveniently there are 8 candidate
numbers I need to store a mask for.
By special casing each of the possibilities (there are 8*8 == 64 of
them), only multiplies and adds are needed, and no shifts -- masks
are evaluated to compile-time constants.
Sadly my computer likes 64 bit
operations better. But the required storage is now only 27% of the
original mask if all numbers are put in it.
Once the primes get above a fairly small number, the density of
places to zap is low enough so that working in bytes is okay.
In particular once we are looking at primes above 240 there is
less than one bit per 64 bit ULL, which makes using bytes
rather than ULLs less painful
One C++ bit I did use was to template all my classes with the storetype,
so I could run with 8, 32 or 64 bits. 64 was fastest. I didn't bother
with 16. The store access is of course behind the hood a cache line.
I can carry on with this, but the storage doesn't have any more nice
sizes. [...]
Yeah. It's a happy coincidence that mod 30 needs 8 bits
per block-of-30, and unfortunately no other nice size in
sight for larger prime-set mods.
Maybe I'll play with it. But the rain stopped, and I must get on
with the weeding!
Now you have something to look forward to the next time
it rains. :)
I'm missing something.
When setting the bits for a multiple of a reasonably large prime what I
do is:
(1) Shift right 1 to drop the last bit.
(2) The remaining top bits tell me which word needs to be accessed. I
shift the multiple right to get the index of the word.
(3) The remaining bottom bits tell me which bit in the word needs to be
set, so I shift 1 left by those bottom bits to get a mask.
My operation (2) is a divide, by a power of 2.
Take 997 as an example:
multiple   mod30   div30
997Â Â Â 7Â Â Â 33
1994Â Â Â 14Â Â Â 66
2991Â Â Â 21Â Â Â 99
3988Â Â Â 28Â Â Â 132
4985Â Â Â 5Â Â Â 166
5982Â Â Â 12Â Â Â 199
6979Â Â Â 19Â Â Â 232
7976Â Â Â 26Â Â Â 265
8973Â Â Â 3Â Â Â 299
9970Â Â Â 10Â Â Â 332
10967Â Â Â 17Â Â Â 365
11964Â Â Â 24Â Â Â 398
12961Â Â Â 1Â Â Â 432
13958Â Â Â 8Â Â Â 465
14955Â Â Â 15Â Â Â 498
15952Â Â Â 22Â Â Â 531
16949Â Â Â 29Â Â Â 564
17946Â Â Â 6Â Â Â 598
18943Â Â Â 13Â Â Â 631
19940Â Â Â 20Â Â Â 664
20937Â Â Â 27Â Â Â 697
21934Â Â Â 4Â Â Â 731
22931Â Â Â 11Â Â Â 764
23928Â Â Â 18Â Â Â 797
24925Â Â Â 25Â Â Â 830
25922Â Â Â 2Â Â Â 864
26919Â Â Â 9Â Â Â 897
27916Â Â Â 16Â Â Â 930
28913Â Â Â 23Â Â Â 963
29910Â Â Â 0Â Â Â 997
30907Â Â Â 7Â Â Â 1030
I don't see how you get that last column without a divide by 30.
Andy
On 31/05/2024 06:17, Tim Rentsch wrote:
Vir Campestris <vir.campestris@invalid.invalid> writes:
I don't store even numbers. That's an optimisation, and is of
course equivalent to Tim's Mod 30 trick, but with mod 2 instead.
And it doesn't require an divide which may be slow.
Like I keep saying, keeping the bits in a mod 30 representation
doesn't need any divides (at least not in the main loop, where
multiples are being computed).
I'm missing something.
When setting the bits for a multiple of a reasonably large prime
what I do is:
(1) Shift right 1 to drop the last bit.
(2) The remaining top bits tell me which word needs to be accessed.
I shift the multiple right to get the index of the word.
(3) The remaining bottom bits tell me which bit in the word needs
to be set, so I shift 1 left by those bottom bits to get a mask.
My operation (2) is a divide, by a power of 2.
Take 997 as an example:
multiple mod30 div30
997 7 33
1994 14 66
2991 21 99
3988 28 132
4985 5 166
5982 12 199
6979 19 232
7976 26 265
8973 3 299
9970 10 332
[etc]
I don't see how you get that last column without a divide by 30.
Pseudocode now looks something like this:
for each index i in the table :
for each bit in table[i] :
if that bit has been marked composite : continue
// we have found the next prime
for each index j > i in the table :
for each bit in table[j] :
if that bit has beem marked composite : continue
// we have found a multiplier m and ..
// need to mark as composite p*m
end
end
end
end
OK, I'm getting some weird results.
Firstly, whatever I do my original algorithm storing odds only is
faster than using mod30. Quite a bit faster. I think this is because
mod30 requires it to access each byte in the map individually, rather
than eight at a time.
How much is when I got weird.
Going through the large primes I've got a cache size.
I go through each of the primes, adding a modulo and a word index to
get the bit that needs to be set next. When the modulo overflows the
word index gets incremented too.
I do that for a block of memory, then repeat for the next prime, and
so on until I've done all the primes. I then go onto the next block of memory - which is tuned to fit in the cache.
Rather than do some arithmetic I though I'll just run through a loop
with different sizes of that memory block, and pick the best
one. That's a bit under a megabyte.
I then remove the loop, instead having a constexpr for the size.
The code now runs a third slower.
I can have the loop go around exactly once, and it is still way faster
than the hard coded constant!
Would you mind if I asked you to post the code so I can
take a look at it? Or if you would rather email it to
me that also works.
On 19/06/2024 01:34, Tim Rentsch wrote:
Would you mind if I asked you to post the code so I can
take a look at it? Or if you would rather email it to
me that also works.
I've hacked out all my include files, added a GPL licence, and
included build instructions. It follows in the signature block.
600 lines :(
I bet the line wrapping breaks it too.
Vir Campestris <vir.campestris@invalid.invalid> writes:
On 19/06/2024 01:34, Tim Rentsch wrote:
Would you mind if I asked you to post the code so I can
take a look at it? Or if you would rather email it to
me that also works.
I've hacked out all my include files, added a GPL licence, and
included build instructions. It follows in the signature block.
600 lines :(
I bet the line wrapping breaks it too.
I got the code. There were some line wrappings but I just went
through and fixed those by hand. After fixing the problem line
wraps I was able to compile and run the code (I used std=c++17,
which apparently works okay). I did a simple check to make sure
the primes being found are right, which they do indeed seem to
be. I haven't really experimented with changing the code; I
only just ran it with various bounds on what primes to find,
mostly to get a sense of the performance.
I admit I don't understand the scheme the code is using. It
looks to me like the code is doing divisions and remainders a
lot, which my code basically doesn't do at all (it does do one
division for each prime up to the square root of the limit of
primes being sought, but that's all). I know the big template
near the beginning is crucial to understanding what the code is
doing, but I haven't been able to figure out what abstraction
that template is supposed to represent. The core of my program
is between 50 and 60 lines, and all pretty straightforward.
Is there some suggestion you could make or a question you would
like to ask? What can I do that might be helpful?
I got the code. There were some line wrappings but I just went
through and fixed those by hand. After fixing the problem line
wraps I was able to compile and run the code (I used std=c++17,
which apparently works okay). I did a simple check to make sure
the primes being found are right, which they do indeed seem to
be. I haven't really experimented with changing the code; I
only just ran it with various bounds on what primes to find,
mostly to get a sense of the performance.
I admit I don't understand the scheme the code is using. It
looks to me like the code is doing divisions and remainders a
lot, which my code basically doesn't do at all (it does do one
division for each prime up to the square root of the limit of
primes being sought, but that's all). I know the big template
near the beginning is crucial to understanding what the code is
doing, but I haven't been able to figure out what abstraction
that template is supposed to represent. The core of my program
is between 50 and 60 lines, and all pretty straightforward.
Is there some suggestion you could make or a question you would
like to ask? What can I do that might be helpful?
On 02/07/2024 07:20, Tim Rentsch wrote:
I got the code. There were some line wrappings but I just went
through and fixed those by hand. After fixing the problem line
wraps I was able to compile and run the code (I used std=c++17,
which apparently works okay). I did a simple check to make sure
the primes being found are right, which they do indeed seem to
be. I haven't really experimented with changing the code; I
only just ran it with various bounds on what primes to find,
mostly to get a sense of the performance.
I admit I don't understand the scheme the code is using. It
looks to me like the code is doing divisions and remainders a
lot, which my code basically doesn't do at all (it does do one
division for each prime up to the square root of the limit of
primes being sought, but that's all). I know the big template
near the beginning is crucial to understanding what the code is
doing, but I haven't been able to figure out what abstraction
that template is supposed to represent. The core of my program
is between 50 and 60 lines, and all pretty straightforward.
Is there some suggestion you could make or a question you would
like to ask? What can I do that might be helpful?
I checked. The modulus operations (%) are easy to check for.
There are two for each prime number. Not for each multiple of it, but
for the primes themselves.
And there's one in the "get" function that checks to see if a number
is prime.
The complexity is at least partly due to an optimisation.
You'll recall perhaps that Bonita's original code preset the entire
bitmap to aa - which is faster than setting all the numbers divisible
by 2. I pointed out that you don't even need to set them, and set off
to write code that stored odd numbers only.
But then I realised that if you store the mask for 3 only you get a
pattern. It has one odd word at the beginning, then a repeating
pattern of 3 words. It doesn't matter what the word size is.
This is equally true of 5, 7, 11 and the first few primes. It's also
true if you set the mask up for 3 and 5 together - except this time
the repeat length is 15.
And it's far faster to duplicate those 3, or 15, or 3*5*7, words than
to do each prime individually.
There comes a point where it isn't worth it though - the repeat length
for all the primes up to 31 is 1.5E12.
Exactly when it stops being worthwhile isn't obvious.
Then you came along with the mod30 suggestion. There is still a repeat length though, and it's still worth merging masks for small primes
together rather than doing each one individually. But the code is
bigger. MUCH bigger.
On 02/07/2024 07:20, Tim Rentsch wrote:
I got the code. [...]
I checked. The modulus operations (%) are easy to check for.
There are two for each prime number. Not for each multiple of it, but
for the primes themselves.
And there's one in the "get" function that checks to see if a number
is prime. [...]
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 494 |
Nodes: | 16 (2 / 14) |
Uptime: | 35:59:14 |
Calls: | 9,741 |
Calls today: | 1 |
Files: | 13,741 |
Messages: | 6,183,472 |