Latent Dirichlet Allocation on Informal Corpora

I’ve been doing a lot of work with topic models over the last year or so, particularly Bayes nets based on Latent Dirichlet Allocation (LDA), so I have a couple of posts planned around the subject.

Quick refresher


LDA models 1) each document and 2) each distinct word in a corpus as a mixture of K latent topics with various probabilities. Since the latent topics themselves are probability distributions over the words in some fixed vocabulary, the usual way to visualize topics is to look at the highest probability words associated with that topic. Here’s an example of some topics from a recent run I did on Wikipedia (in this particular case the words are stemmed on the way in):

  • Topic #0 [‘white’, ‘black’, ‘red’, ‘blue’, ‘green’, ‘color’, ‘light’, ‘colour’, ‘yellow’, ‘gold’, ‘design’, ‘us’, ‘wear’, ‘flag’, ‘fashion’, ‘cloth’, ‘silver’, ‘arm’, ‘dress’]
  • Topic #4 [‘research’, ‘scienc’, ‘field’, ‘physic’, ‘scientif’, ‘technolog’, ‘scientist’, ‘journal’, ‘electron’, ‘laboratori’, ‘engin’, ‘patent’, ‘comput’, ‘magnet’, ‘mechan’, ‘chemistri’, ‘develop’, ‘institut’, ‘interest’, ‘chemic’]
  • Topic #16 [‘function’, ‘mathemat’, ‘theori’, ‘number’, ‘set’, ‘group’, ‘gener’, ‘mathematician’, ‘algorithm’, ‘problem’, ‘represent’, ‘space’, ‘comput’, ‘defin’, ‘refer’, ‘equat’, ‘analysi’]

The word-topic matrix \( \beta \) has dimensions VxK where V is the number of words in some fixed vocabulary and K is the number of latent topics chosen (K=100 is common and works well for many data sets, including Wikipedia). Formally, each topic in LDA is a Dirichlet distribution and \( \beta_{v,k} \) is the probability that topic k will generate word v.

The Wikipedia topics above are represented by the top 20 words in the probability distribution. As a human, I might say that topic #0 is “fashion” and topic #4 is “science” and topic #16 is “math” but it’s difficult to teach a machine to label the topics. In practice that doesn’t always matter. The way one usually uses LDA is as a form of dimensionality reduction in order to accomplish some other task such as prediction or discovering similarities.

While slightly more complicated to implement than something like SVD, LDA can identify related words even if they do not co-occur in any document.

For the intuition behind LDA, check out Edwin Chen’s blog post (not as math-heavy) or this great lecture by David Blei (more math).

Informal Corpora


Wikipedia can be described as a formal-language corpus. Most of the non-entity text in a particular article will match an English dictionary. There are very few misspellings or slang words (unless the article is describing slang in which case the slang word may be enclosed in quotes).

Most of the work done on LDA has used formal texts: in the original paper (Blei, Ng & Jordan 2003), the TREC AP data set was used and I have seen papers which use everything from NIPS conference abstracts to bills in the US Congress.

However, when analyzing a corpus like the Twitter firehose, it’s important to account for some of the idiosyncrasies of informal language. These were a couple tricks I found helpful for pruning the vocab for informal text:

  • do not use a fixed dictionary, even a comprehensive one. Build the vocabulary from the most frequently occurring non-stopwords in the time interval considered. In the case of Twitter, count/rank words, mentions and hashtags separately (individual words are used at a much higher rate than hashtags or mentions so this guarantees that hashtags won’t get squeezed out by absolute rankings).
  • Rebuild said vocabulary often. There is a time-series form of LDA called Dynamic Topic Models. In this model the document-topic probabilities are estimated the same way as in vanilla LDA at each time slice but the word-topic probabilities form a Markov chain such that you’re estimating \( P(\beta_t | \beta_{t-1}) \). This would allow the topics to evolve over time while guaranteeing some consistency across the model. This is basically what I end up doing for Twitter since topics can be quite volatile.
  • try to limit the number of languages considered (or build different models for different languages). While LDA is a statistical model and thus applicable to any bag of words, considering multiple languages when most of the text is English will tend to, for instance, lump all Japanese language documents into a single topic. Since I mostly work with English text, one trick I use is to include only documents where 60% or more of the non-entity words are in an English dictionary. This eliminates most of the non-English stuff and makes the vocab more sane in general while not limiting the vocab to only a fixed dictionary. For instance, a tweet like “Justin Bieber is so hot #beliebers” would match these criteria because the words “is”, “so” and “hot” all appear in the dictionary but since the whole tweet is tokenized, “justin”, “bieber” and the hashtag “#beliebers” would also be included in the word counts for this document and thus in the vocabulary.
  • use a regex like ([\w])\1\1 to exclude any word which repeats a character 3 or more times - eliminates things like “zzzzzzzzzz” which actually will appear in various combinations enough times to rank in the top 10k or so. You can also replace >3 occurrences with exactly 3 occurrences, or with 1 or 2 occurrences.

Results on Twitter


Here are a few of the topics identified for Twitter (this run was on Easter 4/18/2012):

  • [‘one’, ‘easter’, ‘happy’, ‘think’, ‘direction’, ‘birthday’, ‘read’, ‘igo’, ‘beach’, ‘cousin’, ‘bless’, ‘america’, ‘finished’, ‘felt’, ‘oil’, ‘gbu’, ‘holiday’, ‘theres’, ‘makin’, ‘angel’]
  • [‘day’, ‘listen’, ‘following’, ‘lord’, ‘mal’, ‘album’, ‘beer’, ‘killing’, ‘calling’, ‘works’, ‘patient’, ‘cares’, ‘faith’, ‘history’, ‘risen’, ‘spanish’, ‘pasa’, ‘throat’, ‘headed’, “god’s”]
  • [‘getting’, “i’ve”, ‘next’, ‘single’, ‘lost’, “we’re”, ‘drunk’, ‘fast’, ‘voice’, ‘young’, ‘liked’, ‘point’, ‘different’, ‘ones’, ‘loved’, “wasn’t”, ‘knew’, ‘sounds’, ‘fact’, ‘drake’]
  • [‘love’, ‘somebody’, ‘relationship’, ‘usa’, ‘fighting’, ‘stomach’, ‘#Aquarius’, ‘hug’, ‘network’, ‘th’, ‘block’, ‘bf’, ‘cheating’, ‘court’, ‘noise’, ‘remind’, ‘#Rays’, ‘pleasure’, ‘fights’, ‘commitment’]
  • [‘family’, ‘else’, ‘justin’, ‘meet’, ‘bieber’, ‘beliebers’, ‘visit’, ‘shop’, ‘afraid’, ‘stress’, ‘father’, ‘harrys’, ‘london’, ‘anywhere’, ‘apart’, ‘auto’, ‘meeting’, ‘stole’, ‘grand’, ‘hospital’]
  • [‘two’, ‘lose’, ‘tryna’, ‘used’, ‘weight’, ‘energy’, ‘minute’, ‘boost’, ‘removing’, ‘brown’, ‘levels’, ‘fastest’, ‘dj’, ‘moves’, ‘screen’, ‘toxins’, ‘biggest’, ‘sports’, ‘rush’, ‘smiling’]
  • [‘time’, ‘today’, ‘white’, ‘top’, ‘least’, ‘news’, ‘daily’, ‘stories’, ‘bag’, ‘wonderful’, ‘center’, ‘commercial’, ‘capable’, ‘gun’, ‘report’, ‘cap’, ‘cancer’, ‘wash’, ‘mercy’, ‘bear’]
  • [‘friends’, ‘tell’, ‘wrong’, ‘bored’, ‘wtf’, ‘worst’, ‘women’, ‘trust’, ‘gay’, ‘#Capricorn’, ‘wear’, ‘loyal’, ‘warm’, ‘enemies’, ‘nightmare’, ‘hearted’, ‘hurts’, ‘common’, ‘beauty’, ‘vote’]

I found it interesting that a few astrological sign hashtags tend to make it into the top probability words in different topics and that the other high probability words in said topics are (apparently, I know nothing about and have had to ask) descriptive of their associated Zodiac signs. Horoscopes tend to use the same ten words to describe the various signs and so that probably contributes to the strong topical associations.

It also appears that major events (Easter in this case), will form their own topics - the first two topics in the above list can be described as “secular Easter” and “religious Easter” respectively.

I’m still experimenting with the Twitter model. More on variational inference, scaling and use cases for predictive tasks to come…

Numpy gems

I am continually amazed by how awesome numpy can be. While some parts of the API are a little crusty, it does give you a pretty nice set of vectorized operations (read: in a general purpose language - R and Octave are cool, but when you’re developing online learning algorithms, it’s nice when your statistical code is in the same language that talks to your databases, etc.)

Let’s say I want to write code to implement a simple matrix multiplication, like this:

\[ y = \theta^TX \]

Here’s one (terribly naive, doubles-only) implementation of a 2D matrix multiplication in ANSI C:

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

double **matrix(int rows, int cols)
{
    double **m;
    int i;
    m = (double **) malloc(rows * sizeof(double *));
    for (i=0; i < rows; i++) {
        m[i] = (double *) malloc(cols * sizeof(double));
        if (m[i] == NULL)
            return NULL;
    }
    return m;
}

void free_matrix(double **matrix, int rows)
{
    int i;
    for (i=0; i < rows; i++)
        free(matrix[i]);
    free(matrix);
}

double **matrix_multiply(double **a, double **b, int a_rows, int a_cols, int b_rows, int b_cols)
{
    assert (a_cols == b_rows);
    int i, j, k;
    double **result = matrix(a_rows, b_cols);
    if (result == NULL) {
        printf("Could not allocate result matrix");
        exit(1);
    }

    for (i=0; i < a_rows; i++)
        for (j=0; j < b_cols; j++)
            for (k=0; k < b_rows; k++)
                result[i][j] += a[i][k] * b[k][j];
    return result;

}

Now, while writing a simple matrix multiplication in C makes for a good opening interview question, there are far more efficient implementations than anything I could ever hope to write available in the form of Level-3 BLAS routines, which can be accessed through the excellent GNU Scientific Library (GSL). Here’s that code:

#include <stdio.h>
#include <gsl/gsl_blas.h>

int main(void)
{
    double a[] = { 1.0, 2.0, 3.0,
                   4.0, 5.0, 6.0 };
    double b[] = { 4.0, 5.0,
                   6.0, 7.0,
                   8.0, 9.0 };

    double c[] = { 0.0, 0.0,
                   0.0, 0.0 };

    gsl_matrix_view A = gsl_matrix_view_array(a, 2, 3);
    gsl_matrix_view B = gsl_matrix_view_array(b, 3, 2);
    gsl_matrix_view C = gsl_matrix_view_array(c, 2, 2);

    gsl_blas_dgemm(CblasNoTrans, CblasNoTrans,
                   1.0, &A.matrix, &B.matrix,
                   0.0, &C.matrix);

    printf("[ %g, %g\n", c[0], c[1]);
    printf("  %g, %g ]\n", c[2], c[3]);

    return 0;
}

These routines are crazy fast. Things like matrix multiplications are automatically decomposed into submatrices small enough that they can fit in the L1 cache. But they’re kind of cumbersome and low-level. Scientists are usually more interested in thinking about the algebra and the data than the primitive operations.

Luckily we have numpy/scipy, which allow you to do some pretty intense numerical computations in Python. Here’s what the above GSL code would look like in numpy:

A = numpy.array([[1, 2, 3], [4, 5, 6]], dtype=numpy.float64)
B = numpy.array([[4,5], [6,7], [8,9]], dtype=numpy.float64)
C = A.dot(B)
print C

Much simpler, more algebraic-looking code, and it’s often almost as fast as the C version (numpy.dot and a few other methods are actually making calls to BLAS routines if it’s available i.e. if you’re a cool kid with a Fortran compiler!) I haven’t seen a proper benchmark on this, but numpy is among the faster high-level math languages.

I do quite a bit of work with numpy, but many times in the past I’ve found myself having to rely on regular Python data structures like sets and such when I’d really like a nice vectorized way to work with my strongly-typed data. Here are a few pieces of numpy that I’ve found to be particularly helpful:

High-level interface for strongly typed sets (1d arrays - if you don’t know whether all elements are unique, you can make a set with numpy.unique):

import numpy
a = numpy.arange(10)
b = numpy.arange(5, 15)
numpy.setdiff1d(a,b)       # array([0, 1, 2, 3, 4]) 
numpy.union1d(a,b)         # array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
numpy.intersect1d(a,b)     # array([5, 6, 7, 8, 9])

Building arrays incrementally from a Python generator:

import numpy
# Imagine these come from a db cursor or something
coordinates = [(1,2,3), (4,5,6), (7,8,9)]
def my_gen(some_tuple):
    for x, y, z in some_tuple:
        yield x, y, z

a = numpy.fromiter(my_gen(coordinates), dtype=[('x', 'l'), ('y', 'l'), ('z', 'l')])

"""

This creates a numpy record array, which looks like:

array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], 
      dtype=[('x', '<i8'), ('y', '<i8'), ('z', '<i8')])

>>> a['x']
array([1, 4, 7])

"""

I now retract my previous comments about the array module, as numpy.fromiter beats the pants off of it. In my previous post, I found that array.append, while being slower than appending to a Python list, used quite a bit less memory for arrays of a single type.

numpy.fromiter actually appears to be more efficient than Python’s list append (maybe because it saves a Python for loop). Here’s the same benchmark I ran 

python -m timeit -n1 -r100 "a = []" "for i in xrange(1000000):" "    a.append(i)"        
1 loops, best of 100: 93.1 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "a = numpy.fromiter(xrange(1000000), dtype=numpy.int64)"
1 loops, best of 100: 74.3 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "def gen():" "    for i in xrange(1000000):" "        yield i"  "l = []" "for i in gen():" "    l.append(i)"
1 loops, best of 100: 247 msec per loop

python -m timeit -n1 -r100 -s "import numpy" "def gen():" "    for i in xrange(1000000):" "        yield i"  "a = numpy.fromiter(gen(), dtype=numpy.int64)"
1 loops, best of 100: 123 msec per loop

What this means is that building a sparse matrix for some machine learning or statistical problem, using data that came in from your web server or database as an iterator, can be quite efficiently marshaled into numpy arrays without having to preallocate. Pretty nifty!

Discovering this has led me to build some fun generators, especially given the fact that numpy has a record array (I used one in the previous example as well) which lets you build complex data types that are stored column-wise as contiguous arrays - perfect for sparse coordinate matrices.

Here’s another example of fromiter that comes up sometimes in distributed systems. Let’s say I have a bunch of arrays of the same length that are coming in as byte strings, and I want to stack them to build a matrix. I can do something like this:

# Even more fun
def array_gen(some_strings):
    for s in some_strings:
        yield numpy.fromstring(s, dtype=numpy.int)

# Let's say I know the length of all the arrays coming out of my generator
# and I want to build a matrix

K = 10
M = 5

stringified = numpy.arange(K).tostring()
str_gen = (stringified for i in xrange(M))    # This is just a generator object, hasn't done anything yet
m = numpy.fromiter(array_gen(str_gen), dtype=[('cool', numpy.int, K)])

"""
The array would look like this:
>>> m

array([([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],),
       ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],)], 
      dtype=[('cool', ''<i8', 10)])

But you can also access the single matrix through the 'cool' index without making copies

>>> m['cool']

array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
"""

Ok, that’s it for the night. There’s lots more to say about numpy but I’ll save that for another post.

This week’s underused Python module: array

I do a lot of work with sparse matrices. A common example is a term-document matrix where \( X_{i,j} = \#\ of\ times\ word\ W_j\ occurs\ in\ document\ D_i \). Representing this as a dense matrix would be silly since most of the values are 0. Hence we have sparse matrices, available to the Python math/science community in the form of scipy.sparse.

I usually use the CSR (compressed sparse row) format for term-document matrices as well as graphs (also a great candidate for this representation). When building a sparse matrix, I often end up with code that looks like this:

import numpy
from scipy.sparse import csr_matrix
from collections import Counter
row  = [0]
col = []
data = []
for doc in docs:
    features = tokenize(doc)
    features_to_ids = get_or_create_feature_ids(features)
    counts = Counter(features)
    row.append(len(counts))
    for feature, count in counts.iteritems():
        col_id.append(features_to_ids[feature])
        data.append(count)
row = numpy.array(row) col = numpy.array(col) data = numpy.array(data)
matrix = csr_matrix((data, col, row), shape=(len(row)-1, col.max()+1) )

P.S. if this doesn’t make any sense to you, it might be worthwhile to read up on sparse matrices.

Anyway, list.append works fine and has a quite efficient C implementation BUT it’s pretty memory intensive to create a bunch of PyInts (the C type which underlies a Python number) with reference counts when all I want at the end of the day is a typed array of integers.

Numpy arrays are fixed length. So in lieu of doing the allocation and reallocation myself (because I’m lazy), I’ve long settled for building incremental lists as the best option. For most cases this is ok

But today I remembered there’s a handy little module in the standard library called array which can be dynamically resized, has the same API as list (including append, extend, etc.) but requires an explicit type declaration like so:

from array import array

row = array('l', [0])
col = array('l')
data = array('l')

And that’s it! An append method with no Python object overhead.

I decided to benchmark list vs array to see if array.append was as efficient. This is very informal but I’m using CPython 2.7 on an Ubuntu 11.10 EC2 Large instance.

python -m timeit -n1 -r100 -s "from array import array" "a = array('l')" "for i in xrange(1000000):" "    a.append(i)" 
1 loops, best of 100: 249 msec per loop
python -m timeit -n1 -r100 "a = []" "for i in xrange(1000000):" "    a.append(i)"         1 loops, best of 100: 93.1 msec per loop
There’s also numpy.append(a, val), which is horribly slow as it returns a copy, so I didn’t use it in my benchmarks. It can be nice for appending in batches.
Next, let’s look at memory consumption during the tests. Here’s the output of top for the list tests:
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11010 ubuntu    20   0 68036  36m 2284 R    102 0.5   0:10.43 python   
and here’s the same output during the array tests:
   PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
9474 ubuntu    20   0 36504  12m 2292 R  100  0.2   1:09.64 python

So, ~2.5x slower on append but uses ~2.5x less memory.

But that wasn’t the end of the story…

I tried the same test on larger sizes of the array and found the following relationship as the size of the array increases:

  • the ratio of memory consumed by a list vs. an array  increases with the number of elements being added to the array, although it appears to start converging a bit as the number of elements approaches the 50M range
  • the ratio of wall clock time taken for list.append vs. array.append remains constant when you increase the number of elements in the array

For example, on the 10M element test, list takes up 4.2% of RAM at its peak while array takes up 0.7%.

This is pretty awesome news! For a constant factor performance penalty, I can build typed arrays incrementally at about 1/6 the cost of a Python list.

Recommended Wikipedia schema changes

As a data scientist, I work with Wikipedia a lot. So I’ve been building a directed acyclic graph (modeling article links as edges) in order to try out some topic clustering ideas. To get the in-article links, I just parse the XML and regex the articles for various types of Wiki markup, but there are certain links (lists, etc. which form the ontology of Wikipedia) which can only really be found in Wikimedia’s MySQL dump in the categorylink table. Its schema definition looks like this:

CREATE TABLE `categorylinks`(
`cl_from` int(8) unsigned NOT NULL DEFAULT '0',
    `cl_to` varbinary(255) NOT NULL DEFAULT '',
   `cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
   `cl_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    `cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
    `cl_collation` varbinary(32) NOT NULL DEFAULT '',
   `cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
UNIQUE KEY `cl_from` (`cl_from`,`cl_to`),
KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
KEY `cl_collation` (`cl_collation`),
KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

When using InnoDB, it’s extremely important to mind your primary key, as InnoDB stores things on disk in PK order. Here are your options:

  1. You specify the primary key - hopefully you don’t do something dumb like use a VARCHAR. In general, if you use anything other than an AUTO_INCREMENT you’re probably in for some heartbreak because you won’t be writing sequentially to disk.
  2. You don’t specify any primary OR unique keys on the table - InnoDB will provide an AUTO_INCREMENT key for you behind the scenes. Not a bad option.
  3. You don’t specify a primary key BUT you do specify a UNIQUE constraint - InnoDB will treat the first UNIQUE key that you specify as the primary key.

If you’ve ever imported a Wikipedia dump into MySQL, it’s not that MySQL is horribly slow or that the data is incredibly large, it’s the way they’ve written the schema. The primary key for that particular table is an integer PLUS a VARBINARY(255)… To illustrate, here’s what INSERT queries looked like with the schema above vs. simply adding an auto_inc primary key to the table:

  • Their schema: 500 inserts/s decreasing over time, disk space: over 9 GB
  • My schema: 100,000+ inserts/s, same performance over time, disk space: 700MB

Prior to the InnoDB plugin, MySQL would store the primary key with each secondary index, which means a 32-bit integer will add 4 bytes to all of your other indexes and a 64-bit integer will add 8 bytes. I’m using MySQL 5.5, so I don’t have to worry about it, but based on the dump headers, the folks at Wikipedia are on 5.1.47 (I think they added the InnoDB plugin in 5.1.49). I can’t imagine how slow this must be for them. Inno also doesn’t compress the primary key, so the disk space before I changed it to an AUTO_INCREMENT was over 9GB, now down to 700MB or so.

Conclusion: mind your PKs!

Python ETL Black Magic (ported from old blog)

Most of my work these days is in Python and lately I’ve been building a lot of classification/prediction models (multinomial Bayesian, decision trees and latent semantic indexing models). Usually when you’re working in Python on the web, you’re using a web framework - hopefully not Django, but probably - and you don’t really need to know that much core Python beyond simple dicts and lists, maybe a few of the other built-ins, but when it comes to marshaling data in and out of training tuples and scipy matrices, extracting TF-IDF weights, or dealing with data from disparate sources on the web, it helps to know the finer points of the language and standard library.

In the database/reporting world, this kind of data cleansing is known as ETL (extract-transform-load). In my case the actual consumer of the data is an algorithm rather than a database, but the same tools apply.

Here are some lesser-known features of the Python standard library (not even delving into any third-party libraries in this post) that I find most useful.

  1. Nested defaultdict: really useful when building categorized lists, counts, sets, etc:
    from collections import defaultdict
    d = defaultdict(lambda: defaultdict(set))
    d['12-01-01']['tweets'].add(123859385)
    >>> print d['12-01-01'].keys()
    ['tweets']
    >>> print len(d['12-01-01']['tweets'])
    1
  2. namedtuple, great for building a transfer object without a transfer object
    from collections import namedtuple
    TypeId = namedtuple('TypedId', 'type, id')
    entity = TypeId('foo', 1)
    >>> print entity.type, entity.id foo 1
  3. Everything in the itertools module makes me happy, but groupby is especially useful
    from itertools import groupby
    from operator import itemgetter
    my_collection = [('al', 1), ('al', 2), ('bob', 2), ('bob', 4), ('carla', 5)]
    # Attention, this depends on keys being sorted beforehand
    for key, grp in groupby(my_collection, key=itemgetter(0)):     
    print key, len(list(grp)) # Output al 2 bob 2 carla 1
  4. The deque (double-ended queue), great for building a FIFO queue for any algorithm involving breadth-first search (BFS)
    from collections import deque
    # Directed acyclic graph where 0 -> [1,2], 1 -> [0], 2 -> [1]
    graph = [[1,2], [0], [1]]
    seen = set()
    queue = deque([0])
    while queue:
        node = queue.popleft()
        if node not in seen:
            print 'On node', node
            seen.add(node)
        unseen_neighbors = set(graph[node]) - seen
        for neighbor in unseen_neighbors:
            print 'scheduling neighbor %s of %s' % (neighbor, node)
            queue.append(neighbor)
    # Output
    On node 0
    scheduling neighbor 1 of 0
    scheduling neighbor 2 of 0
    On node 1
    On node 2
  5. dict.fromkeys: awesome for initializing a non-default dictionary
    d = dict.fromkeys(['a','b','c','d','e'], 0)
    >>>d['a']
    0
    >>>d['f']
    KeyError: 'f'
  6. Partials, argument currying in Python (often useful for working with instancemethods):
    from functools import partial
    import random
    number_between_one_and = partial(random.randint, 1)
    >>> number_between_one_and(10)
    4
  7. Use sparingly…the metaclass, nice for keeping statistics about subclasses, etc.
    registry = {}
    class MyMeta(type):
       def __init__(cls, name, bases, dict):
           registry[name] = cls
    class MyBase(object):
       __metaclass__ = MyMeta
    class Foo(MyBase):
    pass
    class Bar(MyBase):
       pass
    >>> registry
    {'Bar': <class '__main__.Bar'>, 'MyBase': <class '__main__.MyBase'>, 'Foo': <class '__main__.Foo'>}