Showing posts tagged python

Numpy is my homeboy

I am continually amazed by how awesome numpy can be. While some parts of the API are a little crusty, it does give you a pretty nice set of vectorized operations (read: in a general purpose language - R and Octave are cool, but when you’re developing online learning algorithms, it’s nice when your statistical code is in the same language that talks to your databases, etc.)

Let’s say I want to write code to implement a simple matrix multiplication, like this:

\[ y = \theta^TX \]

Here’s one (terribly naive, doubles-only) implementation of a 2D matrix multiplication in ANSI C:

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

double **matrix(int rows, int cols)
{
    double **m;
    int i;
    m = (double **) malloc(rows * sizeof(double *));
    for (i=0; i < rows; i++) {
        m[i] = (double *) malloc(cols * sizeof(double));
        if (m[i] == NULL)
            return NULL;
    }
    return m;
}

void free_matrix(double **matrix, int rows)
{
    int i;
    for (i=0; i < rows; i++)
        free(matrix[i]);
    free(matrix);
}

double **matrix_multiply(double **a, double **b, int a_rows, int a_cols, int b_rows, int b_cols)
{
    assert (a_cols == b_rows);
    int i, j, k;
    double **result = matrix(a_rows, b_cols);
    if (result == NULL) {
        printf("Could not allocate result matrix");
        exit(1);
    }

    for (i=0; i < a_rows; i++)
        for (j=0; j < b_cols; j++)
            for (k=0; k < b_rows; k++)
                result[i][j] += a[i][k] * b[k][j];
    return result;

}

Now, while writing a simple matrix multiplication in C makes for a good opening interview question, there are far more efficient implementations than anything I could ever hope to write available in the form of Level-3 BLAS routines, which can be accessed through the excellent GNU Scientific Library (GSL). Here’s that code:

#include <stdio.h>
#include <gsl/gsl_blas.h>

int main(void)
{
    double a[] = { 1.0, 2.0, 3.0,
                   4.0, 5.0, 6.0 };
    double b[] = { 4.0, 5.0,
                   6.0, 7.0,
                   8.0, 9.0 };

    double c[] = { 0.0, 0.0,
                   0.0, 0.0 };

    gsl_matrix_view A = gsl_matrix_view_array(a, 2, 3);
    gsl_matrix_view B = gsl_matrix_view_array(b, 3, 2);
    gsl_matrix_view C = gsl_matrix_view_array(c, 2, 2);

    gsl_blas_dgemm(CblasNoTrans, CblasNoTrans,
                   1.0, &A.matrix, &B.matrix,
                   0.0, &C.matrix);

    printf("[ %g, %g\n", c[0], c[1]);
    printf("  %g, %g ]\n", c[2], c[3]);

    return 0;
}

These routines are crazy fast. Things like matrix multiplications are automatically decomposed into submatrices small enough that they can fit in the L1 cache. But they’re kind of cumbersome and low-level. Scientists are usually more interested in thinking about the algebra and the data than the primitive operations.

Luckily we have numpy/scipy, which allow you to do some pretty intense numerical computations in Python. Here’s what the above GSL code would look like in numpy:

A = numpy.array([[1, 2, 3], [4, 5, 6]], dtype=numpy.float64)
B = numpy.array([[4,5], [6,7], [8,9]], dtype=numpy.float64)
C = A.dot(B)
print C

Much simpler, more algebraic-looking code, and it’s often almost as fast as the C version (numpy.dot and a few other methods are actually making calls to BLAS routines if it’s available i.e. if you’re a cool kid with a Fortran compiler!) I haven’t seen a proper benchmark on this, but numpy is among the faster high-level math languages.

I do quite a bit of work with numpy, but many times in the past I’ve found myself having to rely on regular Python data structures like sets and such when I’d really like a nice vectorized way to work with my strongly-typed data. Here are a few pieces of numpy that I’ve found to be particularly helpful:

High-level interface for strongly typed sets (1d arrays - if you don’t know whether all elements are unique, you can make a set with numpy.unique):

import numpy
a = numpy.arange(10)
b = numpy.arange(5, 15)
numpy.setdiff1d(a,b)       # array([0, 1, 2, 3, 4]) 
numpy.union1d(a,b)         # array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
numpy.intersect1d(a,b)     # array([5, 6, 7, 8, 9])

Building arrays incrementally from a Python generator:

import numpy
# Imagine these come from a db cursor or something
coordinates = [(1,2,3), (4,5,6), (7,8,9)]
def my_gen(some_tuple):
    for x, y, z in some_tuple:
        yield x, y, z

a = numpy.fromiter(my_gen(coordinates), dtype=[('x', 'l'), ('y', 'l'), ('z', 'l')])

"""

This creates a numpy record array, which looks like:

array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], 
      dtype=[('x', '<i8'), ('y', '<i8'), ('z', '<i8')])

>>> a['x']
array([1, 4, 7])

"""

I now retract my previous comments about the array module, as numpy.fromiter beats the pants off of it. In my previous post, I found that array.append, while being slower than appending to a Python list, used quite a bit less memory for arrays of a single type.

numpy.fromiter actually appears to be more efficient than Python’s list append (maybe because it saves a Python for loop). Here’s the same benchmark I ran 

python -m timeit -n1 -r100 "a = []" "for i in xrange(1000000):" "    a.append(i)"        
1 loops, best of 100: 93.1 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "a = numpy.fromiter(xrange(1000000), dtype=numpy.int64)"
1 loops, best of 100: 74.3 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "def gen():" "    for i in xrange(1000000):" "        yield i"  "l = []" "for i in gen():" "    l.append(i)"
1 loops, best of 100: 247 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "def gen():" "    for i in xrange(1000000):" "        yield i"  "a = numpy.fromiter(gen(), dtype=numpy.int64)"
1 loops, best of 100: 123 msec per loop

What this means is that building a sparse matrix for some machine learning or statistical problem, using data that came in from your web server or database as an iterator, can be quite efficiently marshaled into numpy arrays without having to preallocate. Pretty nifty!

Discovering this has led me to build some fun generators, especially given the fact that numpy has a record array (I used one in the previous example as well) which lets you build complex data types that are stored column-wise as contiguous arrays - perfect for sparse coordinate matrices.

Here’s another example of fromiter that comes up sometimes in distributed systems. Let’s say I have a bunch of arrays of the same length that are coming in as byte strings, and I want to stack them to build a matrix. I can do something like this:

# Even more fun
def array_gen(some_strings):
    for s in some_strings:
        yield numpy.fromstring(s, dtype=numpy.int)

# Let's say I know the length of all the arrays coming out of my generator
# and I want to build a matrix

K = 10
M = 5

stringified = numpy.arange(K).tostring()
str_gen = (stringified for i in xrange(M))    # This is just a generator object, hasn't done anything yet
m = numpy.fromiter(array_gen(str_gen), dtype=[('cool', numpy.int, K)])

"""
The array would look like this:
>>> m

array([([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],)], 
      dtype=[('cool', ''<i8', 10)])

But you can also access the single matrix through the 'cool' index without making copies

>>> m['cool']

array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
"""

Ok, that’s it for the night. There’s lots more to say about numpy but I’ll save that for another post.

This week’s underused Python module: array

I do a lot of work with sparse matrices. A common example is a term-document matrix where \( X_{i,j} = \#\ of\ times\ word\ W_j\ occurs\ in\ document\ D_i \). Representing this as a dense matrix would be silly since most of the values are 0. Hence we have sparse matrices, available to the Python math/science community in the form of scipy.sparse.

I usually use the CSR (compressed sparse row) format for term-document matrices as well as graphs (also a great candidate for this representation). When building a sparse matrix, I often end up with code that looks like this:

import numpy
from scipy.sparse import csr_matrix
from collections import Counter
row  = [0]
col = []
data = []
for doc in docs:
    features = tokenize(doc)
    features_to_ids = get_or_create_feature_ids(features)
    counts = Counter(features)
    row.append(len(counts))
    for feature, count in counts.iteritems():
        col_id.append(features_to_ids[feature])
        data.append(count)
row = numpy.array(row) col = numpy.array(col) data = numpy.array(data)
matrix = csr_matrix((data, col, row), shape=(len(row)-1, col.max()+1) )

P.S. if this doesn’t make any sense to you, it might be worthwhile to read up on sparse matrices.

Anyway, list.append works fine and has a quite efficient C implementation BUT it’s pretty memory intensive to create a bunch of PyInts (the C type which underlies a Python number) with reference counts when all I want at the end of the day is a typed array of integers.

Numpy arrays are fixed length. So in lieu of doing the allocation and reallocation myself (because I’m lazy), I’ve long settled for building incremental lists as the best option. For most cases this is ok

But today I remembered there’s a handy little module in the standard library called array which can be dynamically resized, has the same API as list (including append, extend, etc.) but requires an explicit type declaration like so:

from array import array

row = array('l', [0])
col = array('l')
data = array('l')

And that’s it! An append method with no Python object overhead.

I decided to benchmark list vs array to see if array.append was as efficient. This is very informal but I’m using CPython 2.7 on an Ubuntu 11.10 EC2 Large instance.

python -m timeit -n1 -r100 -s "from array import array" "a = array('l')" "for i in xrange(1000000):" "    a.append(i)" 
1 loops, best of 100: 249 msec per loop
python -m timeit -n1 -r100 "a = []" "for i in xrange(1000000):" "    a.append(i)"         1 loops, best of 100: 93.1 msec per loop
There’s also numpy.append(a, val), which is horribly slow as it returns a copy, so I didn’t use it in my benchmarks. It can be nice for appending in batches.
Next, let’s look at memory consumption during the tests. Here’s the output of top for the list tests:
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11010 ubuntu    20   0 68036  36m 2284 R    102 0.5   0:10.43 python   
and here’s the same output during the array tests:
   PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
9474 ubuntu    20   0 36504  12m 2292 R  100  0.2   1:09.64 python

So, ~2.5x slower on append but uses ~2.5x less memory.

But that wasn’t the end of the story…

I tried the same test on larger sizes of the array and found the following relationship as the size of the array increases:

  • the ratio of memory consumed by a list vs. an array  increases with the number of elements being added to the array, although it appears to start converging a bit as the number of elements approaches the 50M range
  • the ratio of wall clock time taken for list.append vs. array.append remains constant when you increase the number of elements in the array

For example, on the 10M element test, list takes up 4.2% of RAM at its peak while array takes up 0.7%.

This is pretty awesome news! For a constant factor performance penalty, I can build typed arrays incrementally at about 1/6 the cost of a Python list.

Python ETL Black Magic (ported from old blog)

Most of my work these days is in Python and lately I’ve been building a lot of classification/prediction models (multinomial Bayesian, decision trees and latent semantic indexing models). Usually when you’re working in Python on the web, you’re using a web framework - hopefully not Django, but probably - and you don’t really need to know that much core Python beyond simple dicts and lists, maybe a few of the other built-ins, but when it comes to marshaling data in and out of training tuples and scipy matrices, extracting TF-IDF weights, or dealing with data from disparate sources on the web, it helps to know the finer points of the language and standard library.

In the database/reporting world, this kind of data cleansing is known as ETL (extract-transform-load). In my case the actual consumer of the data is an algorithm rather than a database, but the same tools apply.

Here are some lesser-known features of the Python standard library (not even delving into any third-party libraries in this post) that I find most useful.

  1. Nested defaultdict: really useful when building categorized lists, counts, sets, etc:
    from collections import defaultdict
    d = defaultdict(lambda: defaultdict(set))
    d['12-01-01']['tweets'].add(123859385)
    >>> print d['12-01-01'].keys()
    ['tweets']
    >>> print len(d['12-01-01']['tweets'])
    1
  2. namedtuple, great for building a transfer object without a transfer object
    from collections import namedtuple
    TypeId = namedtuple('TypedId', 'type, id')
    entity = TypeId('foo', 1)
    >>> print entity.type, entity.id foo 1
  3. Everything in the itertools module makes me happy, but groupby is especially useful
    from itertools import groupby
    from operator import itemgetter
    my_collection = [('al', 1), ('al', 2), ('bob', 2), ('bob', 4), ('carla', 5)]
    # Attention, this depends on keys being sorted beforehand
    for key, grp in groupby(my_collection, key=itemgetter(0)):     
    print key, len(list(grp)) # Output al 2 bob 2 carla 1
  4. The deque (double-ended queue), great for building a FIFO queue for any algorithm involving breadth-first search (BFS)
    from collections import deque
    # Directed acyclic graph where 0 -> [1,2], 1 -> [0], 2 -> [1]
    graph = [[1,2], [0], [1]]
    seen = set()
    queue = deque([0])
    while queue:
        node = queue.popleft()
        if node not in seen:
            print 'On node', node
            seen.add(node)
        unseen_neighbors = set(graph[node]) - seen
        for neighbor in unseen_neighbors:
            print 'scheduling neighbor %s of %s' % (neighbor, node)
            queue.append(neighbor)
    # Output
    On node 0
    scheduling neighbor 1 of 0
    scheduling neighbor 2 of 0
    On node 1
    On node 2
  5. dict.fromkeys: awesome for initializing a non-default dictionary
    d = dict.fromkeys(['a','b','c','d','e'], 0)
    >>>d['a']
    0
    >>>d['f']
    KeyError: 'f'
  6. Partials, argument currying in Python (often useful for working with instancemethods):
    from functools import partial
    import random
    number_between_one_and = partial(random.randint, 1)
    >>> number_between_one_and(10)
    4
  7. Use sparingly…the metaclass, nice for keeping statistics about subclasses, etc.
    registry = {}
    class MyMeta(type):
       def __init__(cls, name, bases, dict):
           registry[name] = cls
    class MyBase(object):
       __metaclass__ = MyMeta
    class Foo(MyBase):
    pass
    class Bar(MyBase):
       pass
    >>> registry
    {'Bar': <class '__main__.Bar'>, 'MyBase': <class '__main__.MyBase'>, 'Foo': <class '__main__.Foo'>}