Wednesday, May 30, 2012

Fun with Python and Bit Sets

Bit sets? What are they? And Why?

First off, what's a bit set? Well, it's a particular way to represent sets of integers. And it's a pretty cool way, too! Let's say you expect integers in the range [0, N], well, you can represent this set as a N-bit string, where the ath bit is 1 if and only if a is in the set.

Now for an Example

For whatever reason, I recently found myself trying to solve the following problem: For n points in the plane, find a set of lines such that for any line that partitions the n points into two sets of points, there exists a line in the set that forms that same partition.

How do we go about solving this? I chose the most straight-forward approach -- exhaustive search. I took every pair of points in the set (this is n choose 2) and formed two lines that cut between these pairs. Now, as I check all the partitions formed by these two lines, I need to check whether or not I have seen a particular partition yet. How do I represent partitions and how do I check them quickly?

Bit sets.

Not only are bit sets fast and compact, but they have a wonderful quality-- a particular set can only be represented by one sequence of bytes. So to check whether or not I have seen a set before, I can just store the previously seen sets in a tree or hash set. This makes that check very fast in comparison to checking it against every set that has come before.

Can't You Use Hash Sets?

The problem with hash sets is that a particular set can be represented with different structures. For example, if you have multiple values that hash to the same bucket in your hash set, you need to store a linked list to the values. That linked list can be reordered. It could have different pointer values. For all of these reasons, checking whether or not two hash sets are equal requires that you do more complicated checks.

Using Bit Sets in Python

So how are we going to do this? First let's see if anyone has implemented this before. A quick google search leads us to the bitarray library. Oh wonderful, it looks like bitarray is more or less what we're looking for.

Let's see, aside from some baffling initialization behavior and a lack of a good __hash__ function, this is good stuff. Let's see how it stacks up. First, I'll hack together a simple testing script...

import bitarray, random, timeit, array
import sys
set_size = 100
num_sets = set_size*set_size

def test_pytuple():
    seen_before = set()
    rng = random.Random()
    
    for i in range(0, num_sets):
        b = (rng.randint(0,1) for x in range(0, set_size))
        if b not in seen_before:
            seen_before.add(b)
        
def test_bitarray():
    seen_before = set()
    rng = random.Random()
    
    for i in range(0, num_sets):
        b = bitarray.bitarray([rng.randint(0,1) 
                               for x in range(0, set_size)])
 if b not in seen_before:
            seen_before.add(b)

def test_array():
    seen_before = set()
    rng = random.Random()
    
    for i in range(0, num_sets):
        b = array.array('b' , (rng.randint(0,1) 
                               for x in range(0, set_size)))
 if b.tostring() not in seen_before:
            seen_before.add(b.tostring())
    

if __name__ == "__main__":
    if sys.argv[1] == "bitarray":
        t = timeit.Timer(test_bitarray)
        print "Bitarray average time: %f s" % (t.timeit(10)/10.0)
        if sys.argv[1] == "tuple":
            t = timeit.Timer(test_pytuple)
            print "pytuple average time: %f s" % (t.timeit(10)/10.0)
            if sys.argv[1] == "array":
                t = timeit.Timer(test_array)
                print "array average time: %f s" % (t.timeit(10)/10.0)

Here, we can see that I added some code for testing some alternatives as well... Let's see what happens:

$ python bitarrays.py bitarray
Bitarray average time: 2.327128 s

Well that's not great. What if I just use python tuples or the array?

$ python bitarrays.py tuple
pytuple average time: 0.037775 s
$ python bitarrays.py array
array average time: 2.461786 s

Tuples are Better?!

Hammer of Thor! It looks like the tuple implementation is a lot faster. Why is that happening? Is there something wrong with the way I am testing this? Something looks fishy here, particularly this line:

b = (rng.randint(0,1) for x in range(0, set_size))

Why is this line fishy? Well, the tuple implementation doesn't have to convert from a list. What if we force it to by doing something like this:

b = tuple([rng.randint(0,1) for x in range(0, set_size)])

Then it's run time shoots up to 2.292623 s! Okay, so we've discovered that really what was important in this silly test was probably the construction of the set more than the actual usage. Okay, but what about memory usage? Surely, bitarrays must have an advantage there. After checking peak memory usage using a memusg script, I found the following:

StructurePeak Memory Allocation (mb)
tuple21408
bitarray7400
array7400

Importantly, this measures allocated memory and not actually memory usage. However, we can see, as one would expect, that tuples require more memory. They represent each boolean as a full integer, while arrays and bitarrays are more compressed.

So what have we learned? Well, you can represent compact sets using an array of booleans, and it's efficient to do this. One clever way to do this is using bitarrays, but they may be a bit too clever for what you actually want to do. In Python, you may just want to use a tuple of integers and call it a day.