Friday, December 21, 2012

Using Python array.arrays efficiently

In Python, the list is the workhorse "array-like" data structure, but the standard library does provide an alternative: the array, which is defined in a module of the same name.  Most of the time, lists are the best (and certainly the most versatile) choice, but the array.array is useful any time you need a thin Python veneer on top of a C-style array.  For instance, perhaps you need the memory efficiency of a fixed-type array structure, or you need to create or process simple, byte-packed, sequential data.

The array.array was perfect for a project I am working on, and I was curious about how to use it most efficiently.  For the sorts of situations where array.array is useful, it is quite likely that you will know exactly how much storage space you need by the time you are ready to instantiate the array object.  Consequently, I was surprised to see that the array constructor does not include a parameter for specifying an initial capacity.  Since array.array guarantees that all elements are contiguous in memory, then increasing the array's size could require copying all previous array elements, which is relatively expensive.  Specifying an initial size is an obvious way to avoid this inefficiency.

The constructor does have an optional argument, though, that lets you provide an "initializer" -- an object that provides initial values for the array.  Does an initializer improve the array's performance?

To test this, I wrote some simple code that creates an array of unsigned chars with 10,000,000 elements and assigns a value to each element (20, in this case).  The first way to do this is without the initializer.

arr = array.array('B')
for i in xrange(10000000):
    arr.append(20)

The second option is to initialize the array first, using a list.  Note that if all we wanted to do was assign a single value to all array elements, the initializer would be all we needed!  The point of this exercise, though, is to test the performance for situations where we must separately assign each array element its own value.  So just imagine that rather than assigning 20 to each element, we are assigning something unpredictable and more meaningful.

arr = array.array('B', [0]*10000000)
for i in xrange(10000000):
    arr[i] = 20

I ran each of these code snippets 100 times (in Python 2.7), timed each run (using time.clock()), and calculated the mean run time.  The first option, with no initializer, took an average of 1.73 seconds.  The second option took 1.49 seconds, on average.  Using the initializer is about 14% faster.

So using an initializer definitely makes the subsequent write operations faster.  What about using a tuple, rather than a list, as the initializer?  Tuples should have less overhead, since they are read-only.  Using a tuple as the initializer requires only a small change to the code.

arr = array.array('B', (0,)*10000000)
for i in xrange(10000000):
    arr[i] = 20

This version took 1.46 seconds to run, on average.  So using a tuple as an initializer might be slightly faster than a list, but the difference is negligible and perhaps not even statistically significant.  The important lesson here is that if you need to assign a large number of values to an array.array, you should initialize it first.

Even better would be if the Python developers would add an option to the array constructor to specify the initial capacity.  This would avoid any overhead incurred by copying the initializer values to the array.  For now, though, your best option is to use an initializer.

No comments:

Post a Comment