Measuring Pop Count

January 9, 2020

Let us try to time the various algorithms, to count the number of 1s in an unsigned integer. Spoiler: The results might surprise you.

To take care of the logistics, we shall use a random number generator to generate numbers between 0 and 2^32 - 1. The function generate_unsigned_random_ints(N) will take a number N and will generate N numbers within the range and return them as a list. The implementation is as below:

def generate_unsigned_random_ints(N):
    numbers = []
    for n in range(N):
        numbers.append(random.randint(0, (2**32) - 1))
    return numbers

And, now that we have the generator in place, we shall now look at the various algorithms that we can use to count the number of 1s in a number.

Naive Count #

Here, we count the number of 1s in a number by — right shifting by 1 each time and then using a mask (0X1) to find if the LSB is 1 or 0. This is a very simple algorithm and can be implemented as below:

def count_bits_naive(numbers):
    result = []
    mask = 0x1
    for num in numbers:
        count = 0
        while num != 0:
            count = count + (num & mask)
            num = num >> 1
        result.append(count)
    return result

Here, numbers is a list of random numbers, generated using the generate_unsigned_random_ints(N) function seen earlier. We run our algorithm, on each number until we have seen all the bits in it. The runtime would be in the order of O(B) for each number in N, where B is the number of bits in the number. The total runtime would be O(N * B).

Faster Count #

Let us try to speed up on the naive algorithm that we saw above. We will work with the fact that the following operation: num & (num - 1), results in zeroing out the rightmost 1 in the given number. This can be seen below:

Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 15
>>> bin(x)
'0b1111'
>>> y = x & (x - 1)
>>> y
14
>>> bin(y)
'0b1110'
>>> x = 20
>>> bin(x)
'0b10100'
>>> y = x & (x - 1)
>>> y
16
>>> bin(y)
'0b10000'

We will again iterate over each randomly generated number in the list and perform num & (num -1) until it is zero. The algorithm can be implemented as:

def count_bits_better(numbers):
    result = []
    for num in numbers:
        count = 0
        while num != 0:
            count = count + 1
            num = num & (num - 1)
        result.append(count)
    return result

For the above, the runtime for each number is in the order of O(B), where B is the number of 1s in the number’s binary representation. The total runtime would be O(N * B), where N is the size of the list of numbers.

Count by Lookup #

Now, let us try to divide and conquer the counting of number of 1s in a number. The idea is to break up a number into different parts and use a lookup table to calculate the sub-part and add all the results back together. We will break up the 32bit number to four parts (8bit each). We will then use up the lookup table to find the number of 1s for that 8bit number. The final result will be the addition of all the 8bit results. The algorithm can be implemented as below:

def count_bits_lookup(numbers):
    lookup_table = [
            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4,
            3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5 , 2, 3, 3, 4, 3, 4, 4, 5,
            3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4,
            3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4,
            3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5,
            3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6 , 7, 2, 3, 3, 4,
            3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6,
            7, 7, 8
    ]
    result = []
    for num in numbers:
        first_byte = (num & 0xFF) >> 0
        second_byte = (num & 0xFF00) >> 8
        third_byte = (num & 0xFF0000) >> 16
        fourth_byte = (num & 0xFF000000) >> 24
        count = lookup_table[first_byte] + lookup_table[second_byte] \
                + lookup_table[third_byte] + lookup_table[fourth_byte]
        result.append(count)
    return result

Here, we break up the 32bit number into 4 different numbers, each of size 8bits (by masking and shifting). We then lookup the table to find the number of 1s for that number and finally we add it up to get the result for the 32bit number. The size of the lookup table is 256 (2^8), since each 8bit number can be in the range 0-255. The lookup table can be generated as:

Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> lookup_table = []
>>> for num in range(256):
...   lookup_table.append(bin(num).count('1'))
...

The total time taken would be O(C * N), where C is some constant time taken for the various masking, shifting and lookup operations. And this is assuming that the operations mentioned take constant time.

Count by Transforming as String #

In the final algorithm, we shall convert the number as string and use stdlib to count the 1s in the number. The algorithm would be implemented as:

def count_bits_inbuilt(numbers):
    result = []
    for num in numbers:
        result.append(bin(num).count('1'))

This is straightforward and the total time would be O(N * M), where M is the length of the string and assuming a left to right iteration.

Profiling #

Let us profile the above algorithms to see how they perform on runs with a list of numbers. We will use the cProfile module to run the various functions and time them for us. Although, cProfile will show the profile stats of the whole program, we are interested in only the various algorithms shown above. We can run the profiler like (We are running the algorithms on a list of size 1M — passed to the program):

python -m cProfile bit_count_test.py 1000000

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
....
 1    0.893    0.893    0.976    0.976 bit_count_test.py:15(count_bits_lookup)
 1    4.579    4.579    4.654    4.654 bit_count_test.py:4(count_bits_naive)
 1    2.226    2.226    2.299    2.299 bit_count_test.py:40(count_bits_better)
 1    0.376    0.376    0.908    0.908 bit_count_test.py:50(count_bits_inbuilt)
....
....

Well, the results are surprising. The in-built string comparison actually outperforms all the other algorithms. Looks like the string count algorithm is heavily optimized and although the problem exhibits a tight loop, the runtime is very fast (Probably we can look at its implementation in another post :)). Next, the lookup performs much much better than the other two counting algorithms. Finally, the better counting algorithms is twice as fast as the naive counting algorithm.

Although, in this post we have optimized for runtime, the lookup algorithm is better than the other two counting algorithms. If you look at the memory consumption, the lookup actually uses additional memory, albeit 256 * (sizeof(32-bit int)). Maybe, when optimizing for space, this is the worst algorithm. Also, keep in mind that when developing in languages closer to the metal, compilers might provide you a fast implementation of pop-count that you can use instead that is optimized for speed/space.

That’s it. Thanks for reading.

←

Linux: Page Cache

Custom Deleters in Unique and Shared Pointers

→