Recommended Wikipedia schema changes

As a data scientist, I work with Wikipedia a lot. So I’ve been building a directed acyclic graph (modeling article links as edges) in order to try out some topic clustering ideas. To get the in-article links, I just parse the XML and regex the articles for various types of Wiki markup, but there are certain links (lists, etc. which form the ontology of Wikipedia) which can only really be found in Wikimedia’s MySQL dump in the categorylink table. Its schema definition looks like this:

CREATE TABLE `categorylinks`(
`cl_from` int(8) unsigned NOT NULL DEFAULT '0',
    `cl_to` varbinary(255) NOT NULL DEFAULT '',
   `cl_sortkey` varbinary(230) NOT NULL DEFAULT '',
   `cl_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    `cl_sortkey_prefix` varbinary(255) NOT NULL DEFAULT '',
    `cl_collation` varbinary(32) NOT NULL DEFAULT '',
   `cl_type` enum('page','subcat','file') NOT NULL DEFAULT 'page',
UNIQUE KEY `cl_from` (`cl_from`,`cl_to`),
KEY `cl_timestamp` (`cl_to`,`cl_timestamp`),
KEY `cl_collation` (`cl_collation`),
KEY `cl_sortkey` (`cl_to`,`cl_type`,`cl_sortkey`,`cl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary

When using InnoDB, it’s extremely important to mind your primary key, as InnoDB stores things on disk in PK order. Here are your options:

  1. You specify the primary key - hopefully you don’t do something dumb like use a VARCHAR. In general, if you use anything other than an AUTO_INCREMENT you’re probably in for some heartbreak because you won’t be writing sequentially to disk.
  2. You don’t specify any primary OR unique keys on the table - InnoDB will provide an AUTO_INCREMENT key for you behind the scenes. Not a bad option.
  3. You don’t specify a primary key BUT you do specify a UNIQUE constraint - InnoDB will treat the first UNIQUE key that you specify as the primary key.

If you’ve ever imported a Wikipedia dump into MySQL, it’s not that MySQL is horribly slow or that the data is incredibly large, it’s the way they’ve written the schema. The primary key for that particular table is an integer PLUS a VARBINARY(255)… To illustrate, here’s what INSERT queries looked like with the schema above vs. simply adding an auto_inc primary key to the table:

  • Their schema: 500 inserts/s decreasing over time, disk space: over 9 GB
  • My schema: 100,000+ inserts/s, same performance over time, disk space: 700MB

Prior to the InnoDB plugin, MySQL would store the primary key with each secondary index, which means a 32-bit integer will add 4 bytes to all of your other indexes and a 64-bit integer will add 8 bytes. I’m using MySQL 5.5, so I don’t have to worry about it, but based on the dump headers, the folks at Wikipedia are on 5.1.47 (I think they added the InnoDB plugin in 5.1.49). I can’t imagine how slow this must be for them. Inno also doesn’t compress the primary key, so the disk space before I changed it to an AUTO_INCREMENT was over 9GB, now down to 700MB or so.

Conclusion: mind your PKs!