Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Finding Primes Faster (Sieve of Eratosthenes)

nested loop

  • Please log in to reply
16 replies to this topic

#13 GordonBGoodness

GordonBGoodness

    CC Lurker

  • New Member
  • Pip
  • 7 posts
  • Location:Thailand
  • Programming Language:C, Java, C++, Objective-C, C#, (Visual) Basic, JavaScript, Delphi/Object Pascal, Visual Basic .NET, Pascal, Assembly, Fortran, VBScript, Others

Posted 25 June 2012 - 10:50 PM

There are other simpler solutions... For example, the Sieve of Eratosthenes is good for numbers under 10 mil.
This piece of code calculates 1 mil. primes in 1.29 sec in a for loop.

bool IsAPrime(int n)
{
	for(int i(2); i<=sqrt(n); i++)
		if(n%i == 0) return true;
	return false;
}

But this one does it under approximately 0.9 sec...

int EratosthenesSieve (int upper, int array[])
{
   /* the function takes the upper limit to which we test the numbers in the arguments, as the array in which we store the numbers,
   but it returns the exact number of primes */
	int size=0;
	int i;
	int sieve[18000]; // enough for 2000 primes....
	memset(sieve,0,sizeof sieve);
	int tmp;
	for(int i(2); i<=upper; i++)
	{
		if(!sieve[i])
		{
			array[size++] = i;
			tmp = i+i;
			while(tmp<=upper)
			{
				sieve[tmp] = 1;
				tmp+=i;
			}
		}
	}
return size;
}

Tested on Intel Dual Core 2.56 GHZ and 2GB RAM
There are better ways to do it, but these are the simplest...


Yes, there are a lot of different of ways of sieving for primes, but as it is in the title, it seems that the subject of this thread is the "Sieve of Eratosthenes", which is actually likely the best method up to a range of about 10 billion not 10 million as you state.

Your first example is both not complete and incredibly inefficient as to time performance since it is actually a Prime Sieve by Trial Division which is often mistakenly referred to as the Sieve of Eratosthenes, as per the OP's Wikipedia link which references this method: http://en.wikipedia....of_Eratosthenes. The trial division for all numbers less than the square root of the test number is as implied by the '%' modulo operation. Although Prime Sieve by Trial Division is very simple in terms of lines of code required, it is very computationally complex meaning that the CPU works very hard at computations and the number of those computations increase asymptotically with greater ranges. In fact, your logic doesn't match your "IsAPrime" function name as you return 'true' when the trial division modulo is zero showing that the test number can be factored which should mean that the test number is not a prime where this says IsAPrime = true . The full implementation of this would look something like the following to just count the number of primes up to the range:
bool IsAPrime(int n)
{
	for(int i(2); i<=sqrt(n); i++)
		if(n%i == 0) return false;
	return true;
}

//must call the above over a range in order to do something

#define MAX_NUM 1000000

int count = 0;
for (int i = 2; i <= MAX_NUM; ++i)
	if (IsAPrime(i))
		++count;

Now perhaps due to errors it seems that 1.29 seconds with a 2.53 Gigahertz CPU is fairly slow at this as my Intel I7-2700K at 3.5 Gigahertz takes just 0.265 seconds to get the correct count of 78498 with full optimization enabled in code release mode, but perhaps the i7 performs integer divisions faster as that is mostly how the routine is spending its time - trial division remember - or perhaps you were running in a debug mode without full optimization turned on.

Your second example is indeed the true Sieve of Eratosthenes (unlike the OP's version which uses multiplication to determine the steps across the composite numbers to cull). However your routine also doesn't seem to do the job of sieving for primes up to a million as it has a fixed array size of 18000 rather than a million plus one (1,000,001); thus, I fail to see how you used this routine for timing. In fact, if you try to run this with a local array size of a million and one (1,000,001) four byte integers or even that many one byte chars it will fail due to trying to use too much stack space (four Megabytes/one Megabyte) and one will have to create the array as a global variable or malloc()/new it into heap space in order to get the routine to run. Also, I fail to see how your routine is any "simpler" than the last of my postings above for the Sieve of Eratosthenes, which I will reformat to have a similar calling signature as your routine as follows:
int EratosthenesSieve (int upper, unsigned char sieve[], int array[])
{
   /* the function takes the upper limit to which we test the numbers in the arguments, the sieving array, and the array in which we store the found primes,
   and it returns the exact number of primes found */
  int num=0;
  memset(sieve, 1, MAX_NUM + 1); //make the logic so true means prime
  for (int i(2); i <= upper; ++i)
  {
	if(sieve[i])
	{
	   array[num++] = i;
	   for (int tmp(i + i); tmp <= upper; tmp += i)
		 sieve[tmp] = 0; //cull composite numbers to not a prime
	}
  }
  return num;
}

I don't know how you managed to make this run as slow as 0.9 seconds as on my machine as above it runs in 0.00434 seconds!!!

It is extremely easy and takes almost no code to add the improvement so that culling composite numbers starts at the square of the prime being culled and the highest prime used for culling is the square root of the maximum number in the range, trading one square root per run plus one multiply per base prime number less than the square root of the range for a considerable saving in time, as follows:
int EratosthenesSieve (int upper, unsigned char sieve[], int array[])
{
   /* the function takes the upper limit to which we test the numbers in the arguments, the sieving array, and the array in which we store the found primes,
   and it returns the exact number of primes found */
  int num=0;
  memset(sieve, 1, MAX_NUM + 1); //make the logic so true means prime

  const int SQRT_LIMIT = (int)sqrt((double)upper);
  for (int i(2); i <= SQRT_LIMIT; ++i)
  {
	if(sieve[i])
	{
	   array[num++] = i;
	   for (int tmp(i * i); tmp <= upper; tmp += i)
		 sieve[tmp] = 0; //cull composite numbers to not a prime
	}
  }
  for (int i(SQRT_LIMIT + 1); i <= upper; ++i)
	if (sieve[i])
	   array[num++] = i;
  return num;
}

This code takes about 0.00328 seconds on my machine as above.

As you say there are other techniques including improvements to this one that can reduce the execution time further, with the first thing to try to pre-eliminate the even numbers since two is the only even prime as I did in my second post in this thread (the first post with code) for a further reduction of almost a factor of two in culling composite number time and a possible reduction by about a factor of two in the required sieve array size.

Again, I fail to see how you think your routine is simpler than either of the two versions of the Sieve of Eratosthenes I had posted previously, which also are simple incarnations of the Sieve of Eratosthenes.
  • 0

#14 tpPacino

tpPacino

    CC Regular

  • Member
  • PipPipPip
  • 29 posts
  • Location:Sarajevo, Bosnia and Herzegovina
  • Programming Language:C, Java, C++, PHP, (Visual) Basic, PL/SQL, Pascal, Others
  • Learning:Objective-C, C#, JavaScript, Ruby, Haskell, Others

Posted 26 June 2012 - 02:25 AM

Well, your machine has an i7 processor, which has 4 cores, thus it handles it a lot faster .And yes, the i7 does integer calculation a lot faster, and I mean a lot faster... It all depends on what machine you run it. I got a fairly slow computer, but it isn't the point. The point is, the function does what it's asked for.
The array:

int sieve[18000];
should have the size of 78499 instead of 18000. That is fairly big...
I wrote that, it calculated 1 mil. primes in 1.29 sec in a for loop.
You did improve the function of the Sieve to run ~0.001 sec faster. That is considerably faster.
And don't take my test results relevant, I've got a slow computer, regardless of the full optimization.
And the second code, the code you changed, crashes on runtime. I'll try to work something out, so it works for me.
  • 0

#15 GordonBGoodness

GordonBGoodness

    CC Lurker

  • New Member
  • Pip
  • 7 posts
  • Location:Thailand
  • Programming Language:C, Java, C++, Objective-C, C#, (Visual) Basic, JavaScript, Delphi/Object Pascal, Visual Basic .NET, Pascal, Assembly, Fortran, VBScript, Others

Posted 26 June 2012 - 07:08 PM

Well, your machine has an i7 processor, which has 4 cores, thus it handles it a lot faster .And yes, the i7 does integer calculation a lot faster, and I mean a lot faster... It all depends on what machine you run it. I got a fairly slow computer, but it isn't the point. The point is, the function does what it's asked for.
The array:

int sieve[18000];
should have the size of 78499 instead of 18000. That is fairly big...
I wrote that, it calculated 1 mil. primes in 1.29 sec in a for loop.
You did improve the function of the Sieve to run ~0.001 sec faster. That is considerably faster.
And don't take my test results relevant, I've got a slow computer, regardless of the full optimization.
And the second code, the code you changed, crashes on runtime. I'll try to work something out, so it works for me.


You seem to be a little confused in several respects.

First, although my Intel i7 does indeed have more cores than your dual core (it has 4 full cores and can sometimes run eight separate threads using Hyper Threading - HT), that has nothing to do with the difference in speed for this little Sieve of Eratosthenes (SoE) routine as current examples are only running as a single thread. One can quantify exactly why it is faster for each of the Trial Division and the SoE other than for the raw CPU clock speed which is about 1.5 times faster, as follows:

a. For the Trial Division algorithm spending almost all of its time doing 32 bit divisions, my Sandy Bridge i7 takes about an average of 20 clock cycles to divide whereas, depending on your Dual Core version, your may take an average of about 40 clock cycles per division. Combined with the ratio of clock speeds, there should be about a factor of three difference between us in running the same routine, whereas the factor is closer to five, which is why I think you may be using a different less efficient compiler or haven't enabled full optimization. As you say, it doesn't really matter what the absolute time required is, but the relative time between this and SoE is important as it shows the relative efficiency for the two algorithms if one were trying to evaluate them for even larger ranges.

b. For the second function implementing the SoE algorithm, raw CPU efficiency isn't as important as memory access speed including the availability of cache sizes large enough to hold the sieve[] array. For this algorithm that spends a large part of its time "cache thrashing" the huge sieve[] array, upper end processors such as the i7 have a great advantage due to in their large low latency 8 Megabyte L3 cache which is sufficiently large to hold the entire sieve[] array for this one million range even when it contains 32-bit int's as you have written it (requires four Megabytes). This is likely the reason for the difference in timing such that the true SoE isn't that much faster on your machine than Trial Division, The reason I brought it up is that SoE can be many times faster than Trial Division and the difference will get larger the bigger the range of numbers sieved, which will be true even for your machine as the memory bottleneck is more or less a constant with SoE whereas the computation complexity in division operations keeps increasing exponentially with range for trial division. Better implementations of SoE use a page segmentation scheme so that the sieve[] array can stay smaller than the cache size even for ranges into the billions or more, preferably even using only the L1 data cache as that is the fastest.

I'm starting to wonder if you are part of the Internet phenomenon of self proclaimed experts who post other people's code and don't really understand it themselves. The reason for starting to wonder is that your first posted code (second SoE algorithm) doesn't work as posted and that your second paragraph in this post quoted doesn't make any sense.

Your original EratosthenesSieve() function still won't work with a first argument of the upper limit of the range as 1000000 even if the sieve[] array is sized to have 78499 as you suggest in the quoted post. You are correct that the last argument of the result "array" does need to have a size at least this big in order to hold the found primes up to one million (and isn't strictly necessary if we are only counting the number of primes). However, as written both in your original post and with this correction making sieve have a size of 78499, the program will "crash" in the nested loops due to indexing outside the bounds of the array up to the array index of "upper". Thus, the sieve[] array needs to be of size one greater than "upper" in order to work. This is the same reason that you are crashing the second code I provided, which I tested before posting, since one has to provide a sieve[] array of this size of "upper" plus one as the second argument in order to not index outside the bounds of the array.

Another reason to think that you haven't actually run this second SoE code is that you wrote in your first post that you calculated one million primes in a for loop using the first Trial Division algorithm in 1.29 seconds, which I assumed you miss wrote to actually mean you calculated the primes in a range up to one million in 1.29 seconds to produce the first 48498 primes which is possible, since that is what the second algorithm seems to be meant to do.and that was what the OP did with his code; Producing the first million primes would require sieving up to a range of almost 20 million, which would be a lot larger job and wouldn't be done in around a second on your machine (or even my i7 machine) using Trial Division. Then you stated in your first post that the second algorithm did the same work in about 0.9 seconds. Now you state that with a sieve[] size of 78499 elements that the second of your algorithms in your first post actually works in sieving the range up to one million and that it now takes 1.29 seconds (not about 0.9 seconds as you first stated), which is impossible for this algorithm without declaring "int sieve[100001];" and which will likely crash at run time due to requiring too much stack space. As well, although my Intel i7 is fast, I have old 1.6 Gigahertz single core machines here with small caches and slow memory that will run this same algorithm in less than 0.1 seconds not about a second. What gives here???

I do believe you that you have actually run the Trial Division sieve as described in 1.29 seconds, but so far there is no evidence that you have actually ever got the second SoE algorithm working, at least for a range ("upper" argument) of one million.

All right, I see from your profile that English is not likely your native language and that you are only 17 and likely still learning, but why bother posturing that you know more and have done more than you actually know or have? It would seem to me that this will prevent you from learning as much as you could from discussions such as this one.

Your mistaking the Trial Division Sieve for an example of a Sieve of Eratosthenes is a mistake made by many with much more experience than you have, and it is true that it is one of the easiest algorithms to express very concisely and in fact using recursion (tail calling the same routine) it can be expressed even more concisely as per the Wikipedia article. However, as explained there it isn't a very good algorithm for even a range of one million due to being dog slow due to the huge number of numerical computations required that grows exponentially with range.

There are many more things that can be learned from the second true SoE algorithm as to ways of best adapting it to take maximum advantage of CPU abilities, but you have to get past the point of declaring something works when it doesn't.
  • 0

#16 GordonBGoodness

GordonBGoodness

    CC Lurker

  • New Member
  • Pip
  • 7 posts
  • Location:Thailand
  • Programming Language:C, Java, C++, Objective-C, C#, (Visual) Basic, JavaScript, Delphi/Object Pascal, Visual Basic .NET, Pascal, Assembly, Fortran, VBScript, Others

Posted 27 June 2012 - 11:21 PM

Your first example is both not complete and incredibly inefficient as to time performance since it is actually a Prime Sieve by Trial Division which is often mistakenly referred to as the Sieve of Eratosthenes, as per the OP's Wikipedia link which references this method: http://en.wikipedia....of_Eratosthenes. The trial division for all numbers less than the square root of the test number is as implied by the '%' modulo operation. Although Prime Sieve by Trial Division is very simple in terms of lines of code required, it is very com**tionally complex meaning that the CPU works very hard at com**tions and the number of those com**tions increase asymptotically with greater ranges. In fact, your logic doesn't match your "IsAPrime" function name as you return 'true' when the trial division modulo is zero showing that the test number can be factored which should mean that the test number is not a prime where this says IsAPrime = true . The full implementation of this would look something like the following to just count the number of primes up to the range:

bool IsAPrime(int n)
{
	for(int i(2); i<=sqrt(n); i++)
		if(n%i == 0) return false;
	return true;
}

//must call the above over a range in order to do something

#define MAX_NUM 1000000

int count = 0;
for (int i = 2; i <= MAX_NUM; ++i)
	if (IsAPrime(i))
		++count;

Now perhaps due to errors it seems that 1.29 seconds with a 2.53 Gigahertz CPU is fairly slow at this as my Intel I7-2700K at 3.5 Gigahertz takes just 0.265 seconds to get the correct count of 78498 with full optimization enabled in code release mode, but perhaps the i7 performs integer divisions faster as that is mostly how the routine is spending its time - trial division remember - or perhaps you were running in a debug mode without full optimization turned on.


The following is a bit of a tutorial to show that even a poor algorithm such as Prime Sieving by Trial Division (PSbTD) can be improved by tweaking the algorithm, although of course most improvements that can be made to PSbTD can also be made to the Sieve of Eratosthenes (SoE) in order to make it also run faster.

Although the thread subject is the Sieve of Eratosthenes (SoE), it may be instructive for any following this thread to have a further look at PSbTD since it has been brought up and as many mistake PSbTD for a version of the SoE. It does not qualify as the SoE because the SoE clearly sieves a range of integer numbers in culling composite numbers by repe**ively indexing the next composite factors by an additive span of exactly the size of the prime for the prime factors being culled, whereas this does just as the name says: eliminates composites by determining if the trial number can be evenly divided into them. This is why the OP's example is not really the SoE in that he uses integer multiplication to determine the next culled composite number rather than additive spans, which while CPU multiplications are faster than integer divisions as used by PSbTD, they are not as fast as integer additions as used by the SoE.

First, to show that the complexity of PSbTD isn't really suitable for even sieving number up to ten million, it can easily be determined that the inner loop including an integer division of the IsAPrime() function is running 67,740,404 times just to sieve up to one million, and that it runs the inner loop 1,736,874,713 times to sieve up to ten million; in other words the number of operations increases by a factor of about 25 times for a range increase of only 10 times. By contrast, the basic SoE as discussed only runs its innermost culling loop a total of 2,525,210 times for a range of one million and a total of 26,965,738 times for a range of ten million for a ratio of just over t0 times for 10 times the range and it is also using much faster addition operations in the inner loop rather than divisions. This is why I'm surprised that tpPacino's results aren't relatively much faster using the SoE.

Now, as to programming algorithms, improvements can be made to the PSbTD even though it will never match the speed of the SoE using equivalent improvements. The first one is to automatically have the IsAPrime() function return true for an argument of two and return false for arguments of any even numbers greater than two as two is the only even prime. This will increase the speed by a factor of four by not doing anything for even arguments and only dividing about half the number of times when the argument is odd, although some compilers are smart enough to have automatically optimized away half the operations by recognizing that an odd number divided by an even number will never divide evenly. The code for the new more efficient IsAPrime() function is as follows:
bool IsAPrime(int n)
{
  if (n == 2) return true; //only even prime is 2
  if (n & 1 == 0) return false; //prescreen other even numbers > 2
  for (int i(3); i <= sqrt((float)n); i += 2) //only check for odd multiples
    if(n % i == 0) return false;
  return true;
}

Other than using the innately slower integer divide operations, PSbTD is slow because it has to repeatedly divide by the whole range of (now only odd) numbers, but by doing this it has an advantage in that it doesn't require any significant storage as the factors are rescanned every time it is called. If we allow the algorithm to retain some storage of the primes up to the square root of the maximum range (only 1228 odd primes for up to 10,000, which is all that is required for a total range of a hundred million), then we don't have to scan across all of the (odd) numbers but can rather just check for factors of primes, since any composite number to be culled will always be composed of products of primes. This can be achieved very concisely by recursively calling a new version of the IsAPrime() function from inside itself, with the logic to end the recursion already included by the square root function as the inner loop containing the recursion will never run for an argument less than nine. Also, for any reasonable range we don't need to worry about stack overflow due to the recursion as the range of each successive recursion is reduced to the square root of the range of the outer one; thus, a range of even 100 million only requires about five recursive levels. Now the base primes storage could be implemented by a linked list of values that can grow to fill all of heap memory, but to keep this simple let's just use a basic integer array of sufficient size and an integer variable to keep track of the index of the number of base primes found, which will be one greater than the index of the last prime. This small change will make the code run many times faster yet as now we are only dividing by primes and there are many less primes than numbers. This code is as follows:
int baseprimes[2000]; //a store for found base primes
int numbaseprimes = 0; //value for number of base primes found so far
bool IsAPrime(int n)
{
  if (n == 2) return true; //only even prime is 2
  if ((n & 1) == 0) return false; //prescreen other even numbers > 2
  int factor = 1; //initial value two less than first prime
  //first check argument is a factor of found base primes
  for (int i(0); i < numbaseprimes; ++i)
    if (n % (factor = baseprimes[i]) == 0) return false;
  //then find new base primes as required and check them, also adding them to the store
  for (factor += 2; factor <= sqrt((float)n); factor += 2) //only check for odd multiples
    if (IsAPrime(factor)) //only check prime factors recursively
    {
	  baseprimes[numbaseprimes++] = factor; //increase the base primes storage for new base prime
	  if (n % factor == 0) return false;
    }
  return true;
}

The above code runs many times faster as it only divides 12,740,206 times for a range of one million and 275,628,759 times for a range of ten million, and the advantage grows even further with higher ranges where primes are more sparse; however, note that the number of divide operations still grows faster than the increase in the range size so it still isn't suitable for large number ranges.

One further simplification can be made to the above sample in the case where one wants to produce a primes array containing all of the found primes rather than just the base primes array containing the primes less than the square root of the range rather than just counting the total number of primes in the range, as this eliminates the need for recursively calling to add and check new base primes since all found primes will be included in the primes result array, as follows (NOTE: this only works for scanning by incrementing from the bottom of the range to the top):
int primes[100000]; //a store for found primes
int numprimes = 0; //value for number of primes found so far
bool IsAPrime(int n)
{
  if (n == 2) return true; //only even prime is 2
  if ((n & 1) == 0) return false; //prescreen other even numbers > 2
  //check if argument is a factor of found primes
  for (int i(0); (numprimes > 0) && (primes[i] <= sqrt((float)n)); ++i)
    if (n % primes[i] == 0) return false;
  if ((numprimes == 0) || (n > primes[numprimes - 1]))
    primes[numprimes++] = n; //increase the primes storage for new prime
  return true;
}

The above code doesn't really save any time as it only eliminates the few extra divisions used in recursively finding and adding the extra base primes to the array, but it is a little simpler although has the limitation that the potential primes must be scanned in increasing order.

The point of this exercise is to show that small tweaks can greatly improve an algorithm's performance, even a slow algorithm such as PSbTD whose main advantage (for small ranges of numbers to sieve) is a very small code size or small storage requirement but whose comput ation load increases greatly with larger ranges even with optimizations as shown here. By contrast, the comput ational load for the SoE is many times smaller for larger ranges even though the code is still quite simple. Optimizations as shown here can also be applied to the SoE and other optimizations can also be applied to reduce storage requirements by bit pack ing the sieve[] array whilst not increasing the comput ational time although so doing does again require that one keeps a representation of the base primes up to the square root of the largest number in the range in memory. While such techniques increase the complexity of the code, they also mean that the code runs twenty to thirty times faster than the SoE code shown here where that might be a requirement for very large ranges as in billions.
  • 0

#17 GordonBGoodness

GordonBGoodness

    CC Lurker

  • New Member
  • Pip
  • 7 posts
  • Location:Thailand
  • Programming Language:C, Java, C++, Objective-C, C#, (Visual) Basic, JavaScript, Delphi/Object Pascal, Visual Basic .NET, Pascal, Assembly, Fortran, VBScript, Others

Posted 29 June 2012 - 02:52 AM

There are other simpler solutions... For example, the Sieve of Eratosthenes is good for numbers under 10 mil.

There are many more things that can be learned from the second true SoE algorithm as to ways of best adapting it to take maximum advantage of CPU abilities, but you have to get past the point of declaring something works when it doesn't.


What I am trying to do with the following is show that with a little knowledge of an algorithm and CPU processing bottlenecks, one can greatly increase the performance by tweaks to the algorithm.

I take exception to the statement " the Sieve of Eratosthenes is good for numbers under 10 mil.", which comes from the: http://en.wikipedia....of_Eratosthenes link "The sieve of Eratosthenes is one of the most efficient ways to find all of the smaller primes (below 10 million or so)", which gets that idea from the linked reference: http://primes.utm.ed...eOfEratosthenes as in "The most efficient way to find all of the small primes (say all those less than 10,000,000) is by using a sieve such as the Sieve of Eratosthenes(ca 240 BC)". Since finding the first 10 million primes means sieving a range of about about 200 million, this statement in the reference was probably made based on a) limited RAM memory for the sieving array, and b ) problems in how the Sieve of Eratosthenes (SoE) algorithm was implemented in numbers of clock instructions per clock cycle due to both limitations in the implementation but also limited speed of access to main RAM memory. The real limits of the basic SoE algorithm are closer to a range of twenty billion for finding the first one billion primes, or about a hundred times this large, if one considers that "good" or "efficient" is to be able to compute this range in under a minute on any run-of-the-mill two Gigahertz or faster desktop computer manufactured in the last five years or so even running as a single thread where further savings can be realized by running multi-threaded on a multi-core CPU.

First, the SoE doesn't take as much memory as the sample programs posted in this thread would lead one to believe as follows: a) one actually only requires one bit per number in the range not one four-byte int as currently implemented or even one byte per number to sieve as suggested by the OP. Using this, a 20 billion number range sieve only would require an array size of 2.5 billion bytes or just over 2 Gigabytes, with many modern machines running 64-bit operating systems having four Gigabytes or more of RAM.

Further, there are implementations of the SoE that page through the range in segments so that a very small sieve[] array can be used, only requiring that a representation of the base primes up to the square root of the highest number of the range to be sieved by kept in memory. Such a representation can be kept in four bytes per base prime saved, meaning that about 64 Kilobytes are sufficient to store the base primes and the whole program can be run in less than 128 Kilobytes of memory above the operating system requirements, even for a range up to 20 billion.

There are further efficiency reasons for keeping the sieve[] array small by both bit pack ing and also by page segmentation of the SoE in reducing memory access times. When a composite number culling array exceeds the size of the L1 data cache of typically 16 to 64 Kilobytes, memory access time changes from about four clock cycles to about ten clock cycles for the L2 cache, and when the array exceeds the typical 256 Kilobytes to one Megabyte of the L2 cache the memory access time changes to 20 to 40 clock cycles for the L3 cache if there is one, and when that is exceeded or there is no L3 cache to about 160 clock cycles for the main memory latency. One can clearly see this if one runs one of my versions of the basic SoE such as the one in the second post from the bottom of the first page of this thread with the range of MAX_NUM set to several times the largest CPU cache size, which uses unsigned char = bytes for elements rather than four-byte integers as for the OP or tpPacino. For instance when I run this algorithm for a range of 100 million it takes 0.938 seconds on even my Intel i7 with its 8 Megabyte L3 cache whereas with a range of only 10 million it only takes 0.047 seconds for a factor of about 20 times. Now the amount of time taken should be close to proportional to the ratio of the ranges of ten times, so the extra time is taken by the extra memory access latency times. Slower and older CPU's with less cache and slower memory will be even worse. Segmented pages allowing a smaller sieve[] array size is the answer to this.

But first, let's investigate the effect of bit pack ing on execution times. Bit pack ing will have an advantage due to more than just the time to cull composite numbers from the sieve array due to memory latencies as it will also reduce the amount of time it takes to memset() fill the array and to count the remaining prime flag bits using a look up table by a factor of eight. The following code implements the SoE using one large bit packed array for a range of ten million:
#define MAX_NUM 1000000
#define (ARRAY_SIZE MAX_NUM / 8 + 1)
#define SQRT_LIM (unsigned int)(sqrt((double)MAX_NUM)) //square root of MAX_NUM rounded down
unsigned char packed_sieve[ARRAY_SIZE];
void gen_packedsieve_primes(void)
{
  memset(packed_sieve, 0xFF, ARRAY_SIZE); // marks all elements of the bit array as potential primes
  packed_sieve[0] = 0xFC; //set the two bottom bits meaning the indexs for 0 and 1 to 0 meaning not prime
  packed_sieve[ARRAY_SIZE - 1] = (unsigned char)((0xFE << (MAX_NUM % 8)) ^ 0xFF); //set the unused top bits to 0 meaning not prime
  unsigned char mask = 4; //initialize mask for the bit representing 2
  for (unsigned int prime = 2, word = 0; prime < SQRT_LIM; ++prime) // for all prime candidates starting at two
  {
   if ((packed_sieve[word] & mask) != 0) //if is still marked as prime
	    //starting at the square of the prime composites to be culled
	    for (unsigned int j = prime * prime; j <= MAX_NUM; j += prime)
		  packed_sieve[j >> 3] &= (1 << (j & 7)) ^ 0xFF; //cull all prime composites to maximum limit of array by bit index
   mask <<= 1;
   if (mask == 0)
   {
	    mask = 1;
	    ++word;
   }
  }
}

It takes about 0.033 seconds to calculate all of the primes up to ten million as compared to 0.047 milliseconds for the unpacked version or faster in spite of the extra operations dealing with bit manipul ations, meaning that bit pack ing is actually saving time even though both arrays of 10 Megabytes and 1.25 Megabytes fit primarily in the 8 Megabyte L3 cache and aren't nearly small enough to fit in the 256 Kilobyte L2 cache. This time saving is mainly due to reduced time to count the number of primes as it can be done by eight element bytes rather than by individual single elements, with the counting function to count all eight bits in a packed bit byte in one loop using a look up table as follows:
const static unsigned char CLUT[] = {
  0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
  4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 };
unsigned int bytecount(unsigned int lastbytendx)
{
  unsigned int count = 0;
  for (unsigned int i = 0; i <= lastbytendx; ++i)
	    count += CLUT[paged_sieve[i]];
  return count;
}

So having established that bit pack ing doesn't cost execution time, let's write a segmented paged version to count the number of primes in the range and including the counting by bytes routine and further bit **ng by eliminating storage of representations of even numbers, as follows:
#define LARGE_MAX_NUM 17179869183
#define NDX(x) ((x - 3) / 2) //to get the array index from a prime candidate
#define PRIME(i) (i + i + 3) //reverse above for array representing odds starting with 3 at index 0 = 3,5,7...
//the following should be >= 14, with optimium size for speed generally 14, 15, or 16 depending on CPU L1 data cache
#define PWROFTWO 15 //14,15,16,18,19,20,21,22,23 means 16,32,64,256,512,1024,2048,4096,8192 KBytes, etcetera
//the following page array size must be at least the (NDX(square root of MAX_NUM) + 7) / 8 for the algorithm to work,
//which means that for a MAX_NUM of 1000 billion, it needs to be 64 KBytes or PWROFTWO must be at least 16;
//PWROFTWO of 14 means a MAX_NUM no greater than 68,720,001,025.
#define PAGEARRAY_SIZE (1 << PWROFTWO) //2 raised to the PWROFTWO for the sieve array size
#define NUMPAGEBITS (PAGEARRAY_SIZE * 8)
#define SQRT_LIM (unsigned int)(sqrt((double)LARGE_MAX_NUM)) //square root of MAX_NUM rounded down
#define SQRTSQRT_LIM (unsigned int)(sqrt((double)SQRT_LIM)) //used to sieve base primes
unsigned char paged_sieve[PAGEARRAY_SIZE];
struct baseprime { //to hold a record of the base primes
  unsigned int deltaprime;
  unsigned int index; //used to track the current index possition across multi pages
};
baseprime base_primes[78497]; //enough to be able to determine the number of primes to 1000 billion
void cullpage(unsigned long long lastbitndx)
{
  unsigned int maxprime = (unsigned int)sqrt((double)PRIME(lastbitndx));
  unsigned int topcull = lastbitndx % NUMPAGEBITS;
  for (unsigned int i = 0, prime = 3; prime <= maxprime; prime += base_primes[i++].deltaprime)
  {
   //starting at the saved index for the prime cull by prime spans for indexes of the numbers
	    unsigned int j = base_primes[i].index;
   for (; j <= topcull; j += prime)
	    paged_sieve[j >> 3] &= (1 << (j & 7)) ^ 0xFF; //cull all prime composites to maximum limit of array by bit index
	    base_primes[i].index = j - NUMPAGEBITS; //save index for next page
  }
}
unsigned long long gen_pagedsieve_primes(void)
{
  unsigned long long llcount = 1;  //for the known even prime of 2
  unsigned int count = 0;
  memset(paged_sieve, 0xFF, (NDX(SQRT_LIM) / 8 + 1)); // marks elements of the bit array as potential primes for base primes
  //cull for the base primes
  unsigned char mask = 1; //initialize mask for the bit representing 2
  for (unsigned int prime = 3, word = 0; prime <= SQRTSQRT_LIM; prime += 2) // for all prime candidates starting at 3 by odds
  {
   if ((paged_sieve[word] & mask) != 0) //if is still marked as prime
	    {
		 //starting at the square of the prime composites to be culled working by indexes of the numbers
		 for (unsigned int j = NDX(prime * prime); j <= NDX(SQRT_LIM); j += prime)
		  paged_sieve[j >> 3] &= (1 << (j & 7)) ^ 0xFF; //cull all prime composites to maximum limit of array by bit index
	    }
   mask <<= 1;
   if (mask == 0)
   {
	    mask = 1;
	    ++word;
   }
  }
  //save a copy of the base primes culled
  mask = 1;
  for (unsigned int prime = 3, oldprime = prime, word = 0; prime <= SQRT_LIM; prime += 2)
  {
   if ((paged_sieve[word] & mask) != 0) //if is still marked as prime
	    {
		  if (count > 0)
			    base_primes[count - 1].deltaprime = prime - oldprime;
		  oldprime = prime;
		  base_primes[count++].index = NDX(prime * prime) % NUMPAGEBITS;
	    }
   mask <<= 1;
   if (mask == 0)
   {
	    mask = 1;
	    ++word;
   }
  }
  base_primes[count - 1].deltaprime = 2000; //a bigger prime gap than will even be seen for indexing limit
  unsigned long long topbitndx = NDX(LARGE_MAX_NUM);
  //fill, cull and count by pages
  for (unsigned long long maxpgndx = NUMPAGEBITS - 1; maxpgndx <= topbitndx; maxpgndx += NUMPAGEBITS)
  {
	    memset(paged_sieve, 0xFF, PAGEARRAY_SIZE); // marks elements of the bit array as potential primes for each page
	    cullpage(maxpgndx);
	    llcount += bytecount(PAGEARRAY_SIZE - 1);
  }
  //fill, cull and count last page
  unsigned int lastpgbitndx = topbitndx % NUMPAGEBITS;
  unsigned int lastbytendx = lastpgbitndx / 8;
  unsigned int lastbitmod = lastpgbitndx % 8;
  memset(paged_sieve, 0xFF, lastbytendx + 1); // marks elements of the bit array as potential primes for top page
  paged_sieve[lastbytendx] = (0xFE << lastbitmod) ^ 0xFF; //set the unused top bits to 0 meaning not prime
  cullpage(topbitndx);
  llcount += bytecount(lastbytendx);
  return llcount;
}

This will count all the 762,939,111 primes up to the 34 bit number range of 17,179,869,183 in about 25 seconds on the Intel i7-2700K using a page array size of 16 or 32 Kilobytes, but there are further optimizations called wheel factorization that will reduce the work by another about three times as well as further code efficiency "loop unrolling" optimizations to be used. Also, I haven't turned on multi-core threading which is quite easy now that the work has been segmented into pages in that we can just **ign a group of pages per thread per core for an almost exact division of the work and with very little threading overhead. Thus, I think I have proven that any CPU built in the last five years should be able to run this SoE algorithm up to a range of 20 billion = to about one billion primes in well under a minute and likely no slower than about 30 seconds.

If you want to play with a version of the SoE that includes all of the above optimizations and more, you can download binaries for various platforms and source code from: http://code.google.com/p/primesieve/, although the source code will not likely be easily understandable by most people learning C++ as seem to frequent this forum.

Now, as to why this paging algorithm is faster than the large array models is easy to prove just be varying the PAGEARRAY_SIZE from the 16 Kilobytes as it is now on up by factors of two. Counting the 50847534 primes up to one billion slows just a little as the array is increase to the L2 cache size of 256 Kilobytes due to the losses due to increased memory latencies being compensated by the greater composite number culling efficiency for larger arrays, but when the array size is increased to the L3 cache size of eight Megabytes the algorithm runs twice as slow as its fastest. When the array is increased a factor of eight more to 64 Megabytes so that main memory accesses need to be used more, the time almost doubles yet again. The same algorithm is thus using the same process but is slower due to increased time spent waiting for memory. My simpler and less refined code shows some of this but not to the extent of the highly optimized "primesieve" as per the above link which is running the CPU to the maximum and is therefore the bottleneck of increasing slower memory access with larger sieving array sizes becomes more dramatic.

But isn't it amazing that using the SoE optimizations of "primesieve" one can count all the primes up to 13 digits in well under an hour on a typical higher end desktop computer of today where it likely took many many hours on the fastest "supercomputer" that existed in 1979 when this range of primes was first computed! And the algorithm that was used then? It was a version of SoE, although not likely quite as highly optimized as "primesieve".
  • 0





Also tagged with one or more of these keywords: nested loop

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download