Latent Dirichlet Allocation on Informal Corpora

I’ve been doing a lot of work with topic models over the last year or so, particularly Bayes nets based on Latent Dirichlet Allocation (LDA), so I have a couple of posts planned around the subject.

Quick refresher


LDA models 1) each document and 2) each distinct word in a corpus as a mixture of K latent topics with various probabilities. Since the latent topics themselves are probability distributions over the words in some fixed vocabulary, the usual way to visualize topics is to look at the highest probability words associated with that topic. Here’s an example of some topics from a recent run I did on Wikipedia (in this particular case the words are stemmed on the way in):

  • Topic #0 [‘white’, ‘black’, ‘red’, ‘blue’, ‘green’, ‘color’, ‘light’, ‘colour’, ‘yellow’, ‘gold’, ‘design’, ‘us’, ‘wear’, ‘flag’, ‘fashion’, ‘cloth’, ‘silver’, ‘arm’, ‘dress’]
  • Topic #4 [‘research’, ‘scienc’, ‘field’, ‘physic’, ‘scientif’, ‘technolog’, ‘scientist’, ‘journal’, ‘electron’, ‘laboratori’, ‘engin’, ‘patent’, ‘comput’, ‘magnet’, ‘mechan’, ‘chemistri’, ‘develop’, ‘institut’, ‘interest’, ‘chemic’]
  • Topic #16 [‘function’, ‘mathemat’, ‘theori’, ‘number’, ‘set’, ‘group’, ‘gener’, ‘mathematician’, ‘algorithm’, ‘problem’, ‘represent’, ‘space’, ‘comput’, ‘defin’, ‘refer’, ‘equat’, ‘analysi’]

The word-topic matrix \( \beta \) has dimensions VxK where V is the number of words in some fixed vocabulary and K is the number of latent topics chosen (K=100 is common and works well for many data sets, including Wikipedia). Formally, each topic in LDA is a Dirichlet distribution and \( \beta_{v,k} \) is the probability that topic k will generate word v.

The Wikipedia topics above are represented by the top 20 words in the probability distribution. As a human, I might say that topic #0 is “fashion” and topic #4 is “science” and topic #16 is “math” but it’s difficult to teach a machine to label the topics. In practice that doesn’t always matter. The way one usually uses LDA is as a form of dimensionality reduction in order to accomplish some other task such as prediction or discovering similarities.

While slightly more complicated to implement than something like SVD, LDA can identify related words even if they do not co-occur in any document.

For the intuition behind LDA, check out Edwin Chen’s blog post (not as math-heavy) or this great lecture by David Blei (more math).

Informal Corpora


Wikipedia can be described as a formal-language corpus. Most of the non-entity text in a particular article will match an English dictionary. There are very few misspellings or slang words (unless the article is describing slang in which case the slang word may be enclosed in quotes).

Most of the work done on LDA has used formal texts: in the original paper (Blei, Ng & Jordan 2003), the TREC AP data set was used and I have seen papers which use everything from NIPS conference abstracts to bills in the US Congress.

However, when analyzing a corpus like the Twitter firehose, it’s important to account for some of the idiosyncrasies of informal language. These were a couple tricks I found helpful for pruning the vocab for informal text:

  • do not use a fixed dictionary, even a comprehensive one. Build the vocabulary from the most frequently occurring non-stopwords in the time interval considered. In the case of Twitter, count/rank words, mentions and hashtags separately (individual words are used at a much higher rate than hashtags or mentions so this guarantees that hashtags won’t get squeezed out by absolute rankings).
  • Rebuild said vocabulary often. There is a time-series form of LDA called Dynamic Topic Models. In this model the document-topic probabilities are estimated the same way as in vanilla LDA at each time slice but the word-topic probabilities form a Markov chain such that you’re estimating \( P(\beta_t | \beta_{t-1}) \). This would allow the topics to evolve over time while guaranteeing some consistency across the model. This is basically what I end up doing for Twitter since topics can be quite volatile.
  • try to limit the number of languages considered (or build different models for different languages). While LDA is a statistical model and thus applicable to any bag of words, considering multiple languages when most of the text is English will tend to, for instance, lump all Japanese language documents into a single topic. Since I mostly work with English text, one trick I use is to include only documents where 60% or more of the non-entity words are in an English dictionary. This eliminates most of the non-English stuff and makes the vocab more sane in general while not limiting the vocab to only a fixed dictionary. For instance, a tweet like “Justin Bieber is so hot #beliebers” would match these criteria because the words “is”, “so” and “hot” all appear in the dictionary but since the whole tweet is tokenized, “justin”, “bieber” and the hashtag “#beliebers” would also be included in the word counts for this document and thus in the vocabulary.
  • use a regex like ([\w])\1\1 to exclude any word which repeats a character 3 or more times - eliminates things like “zzzzzzzzzz” which actually will appear in various combinations enough times to rank in the top 10k or so. You can also replace >3 occurrences with exactly 3 occurrences, or with 1 or 2 occurrences.

Results on Twitter


Here are a few of the topics identified for Twitter (this run was on Easter 4/18/2012):

  • [‘one’, ‘easter’, ‘happy’, ‘think’, ‘direction’, ‘birthday’, ‘read’, ‘igo’, ‘beach’, ‘cousin’, ‘bless’, ‘america’, ‘finished’, ‘felt’, ‘oil’, ‘gbu’, ‘holiday’, ‘theres’, ‘makin’, ‘angel’]
  • [‘day’, ‘listen’, ‘following’, ‘lord’, ‘mal’, ‘album’, ‘beer’, ‘killing’, ‘calling’, ‘works’, ‘patient’, ‘cares’, ‘faith’, ‘history’, ‘risen’, ‘spanish’, ‘pasa’, ‘throat’, ‘headed’, “god’s”]
  • [‘getting’, “i’ve”, ‘next’, ‘single’, ‘lost’, “we’re”, ‘drunk’, ‘fast’, ‘voice’, ‘young’, ‘liked’, ‘point’, ‘different’, ‘ones’, ‘loved’, “wasn’t”, ‘knew’, ‘sounds’, ‘fact’, ‘drake’]
  • [‘love’, ‘somebody’, ‘relationship’, ‘usa’, ‘fighting’, ‘stomach’, ‘#Aquarius’, ‘hug’, ‘network’, ‘th’, ‘block’, ‘bf’, ‘cheating’, ‘court’, ‘noise’, ‘remind’, ‘#Rays’, ‘pleasure’, ‘fights’, ‘commitment’]
  • [‘family’, ‘else’, ‘justin’, ‘meet’, ‘bieber’, ‘beliebers’, ‘visit’, ‘shop’, ‘afraid’, ‘stress’, ‘father’, ‘harrys’, ‘london’, ‘anywhere’, ‘apart’, ‘auto’, ‘meeting’, ‘stole’, ‘grand’, ‘hospital’]
  • [‘two’, ‘lose’, ‘tryna’, ‘used’, ‘weight’, ‘energy’, ‘minute’, ‘boost’, ‘removing’, ‘brown’, ‘levels’, ‘fastest’, ‘dj’, ‘moves’, ‘screen’, ‘toxins’, ‘biggest’, ‘sports’, ‘rush’, ‘smiling’]
  • [‘time’, ‘today’, ‘white’, ‘top’, ‘least’, ‘news’, ‘daily’, ‘stories’, ‘bag’, ‘wonderful’, ‘center’, ‘commercial’, ‘capable’, ‘gun’, ‘report’, ‘cap’, ‘cancer’, ‘wash’, ‘mercy’, ‘bear’]
  • [‘friends’, ‘tell’, ‘wrong’, ‘bored’, ‘wtf’, ‘worst’, ‘women’, ‘trust’, ‘gay’, ‘#Capricorn’, ‘wear’, ‘loyal’, ‘warm’, ‘enemies’, ‘nightmare’, ‘hearted’, ‘hurts’, ‘common’, ‘beauty’, ‘vote’]

I found it interesting that a few astrological sign hashtags tend to make it into the top probability words in different topics and that the other high probability words in said topics are (apparently, I know nothing about and have had to ask) descriptive of their associated Zodiac signs. Horoscopes tend to use the same ten words to describe the various signs and so that probably contributes to the strong topical associations.

It also appears that major events (Easter in this case), will form their own topics - the first two topics in the above list can be described as “secular Easter” and “religious Easter” respectively.

I’m still experimenting with the Twitter model. More on variational inference, scaling and use cases for predictive tasks to come…

Notes

  1. robanhk reblogged this from thatdatabaseguy
  2. thatdatabaseguy posted this