September 21, 2010ios
We spend quite a lot of time at 10gen supporting MongoDB users. The questions we receive are truly legion but, as you might guess, they tend to overlap. We get frequent queries on sharding, replica sets, and the idiosyncrasies of JavaScript, but the one subject that never fails to appear each day on our mailing list is indexing.mongodb
Now, to be clear, I’m not talking about how to create an index. That’s easy. The trouble runs much deeper. It’s knowing how indexes work and having the intuition to create the best indexes for your queries and your data set. Lacking this intuition, your production database will eventually slow to a crawl, you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.app
This need not be your fate. You can understand indexing! All that’s required is the right mental model, and over the course of this series, that’s just what I hope to provide.less
But caveat emptor: what follows is a thought experiment. To get the most out of this post, you can’t skim it. Read every word. Use your imagination. Think through the quizzes. Do this, and your indexing struggles may soon be no more.dom
To understand indexing, you need a picture in your head. So imagine a cookbook. And not just any cookbook. A massive cookbook, five thousand pages long with the most delicious recipes for every occasion, cuisine, and season with all the good ingredients you might find at home. This is the cookbook to end them all. Let’s call it The Cookbook Omega.ide
Now, although this might be best of all possible cookbooks, there are two tiny problems with the Cookbook Omega. The first is that the recipes are in random order. On page 3,475 you have Australian Braised Duck, and on page 2 there’s Zacatecan Tacos.post
That would be mangeable were it not for the second problem: the Cookbook Omega has no index.ui
The solution is to build an index.this
There are few ways you can imagine searching for a recipe, but the recipe’s name is probably a good place to start. If we create an alphabetical listing at the end of the book of each recipe’s name followed by its page number, then we’ll have indexed the book by recipe name. A few entries might look like this:google
As long as we know the name of recipe, or even the first few letters of that name, we can use this index to quickly find that recipe in the book. If that’s the only way we expect to search for recipes, then our work is done.
But, of course, this is unrealistic. Because we can also imagine wanting to find recipes based on the ingredients we have in our pantry. Or perhaps we want to search by cuisine. For those cases, we need more indexes.
So, we need to build another index, this time on ingredients. In this index, we have an alphabetical listing of ingredients each pointing to the page number of recipes containing that ingredient. The most basic index on ingredients would thus look like this:
Is the index you thought you were going to get? Is it even helpful?
This index is good if all you need is a list of recipes for a given ingredient. But if you have any other information about the recipe that you want to include in your search, you still have some scanning to do. Once you know the page numbers where Cauliflower is referenced, you then need to go to each of those pages to get the name of the recipe and what cuisine it’s part of. This is, of course, better than paging through the whole book, but there are still improvements to be made.
Happily, there is a solution to the long lost chicken recipe, and its answer lies in the use of compound indexes.
The two indexes we’ve created so far are single-key indexes: they both order just one data item from each recipe. We’re now going to build out yet another index for the Cookbook Omega, but this time, instead of using just one data item, we’ll use two. Indexes that use more than one key like this are called compound indexes.
This compound index uses both ingredients and recipe name, in that order. We’ll notate the index like this: ingredient
→name
. Here’s what part of this index would look like:
The value of this index for a human is obvious. You can now search by ingredient and probably very quickly find the recipe you want. For a machine, it’s still valuable for this use case, but only if we can provide the first letters of the recipe name, which will keep the database from having to scan every recipe name listed for that ingredient. This compound index would be especially useful if, as with the Cookbook Omega, we had several hundred (or thousand) chicken recipes.
name
→
ingredient
. Would this index be interchangeable with the inverse compound index we just explored?
ingredient
→
name
. With the compound index in place, is it possible to eliminate either of the first two indexes we created?
ingredient
→
name
. If we know an ingredient, we can, using that compound index, easily get a list of all page numbers containing said ingredient. Look again at the sample entries for this index to see why this is so.
ingredient
→
name
to find that recipe’s page number?
We’re going to cover one more concept that you can use to design better indexes. It’s the idea that some keys for your data will be more selective than others. Imagine that the recipes in the Cookbook Omega consist of a total of 200 different ingredients but only represent 12 different cuisines. If the cookbook contains 5,000 recipes, which key — ingredients or cuisine — is more selective?
Intuitively, it should be easy to see that ingredient narrows the the number of recipes much more than cuisine does. On averge, there will be 417 (5,000 / 12) recipes per cuisine but only 125 recipes per ingredient (5,000 / 200 * 5). This assumes an average of five ingredients per recipe, but clearly, the actual selectivity of any given ingredient will always be hard to estimate. Some ingredients (chicken) will be less selective than others (anise). But ingredient is generally the more selective of the two fields.
ingredient
→
cuisine
. In addition to providing a more selective first key, this compound index will actually be useful for queries on ingredient only. It’s easy to see how the inverse compound index on cuisine→ingredient wouldn’t be all that useful for the opposite case, where the search is on cuisine alone. (And of course, we all know now that queries on a compound index aren’t possible if all we have is the second key.)
ingredient
→
cuisine
and
cuisine
→
ingredient
as they would appear as indexes in the Cookbook Omega.
The goal of this post was to lay a groundwork for readers needing a better mental model of indexes. Having a solid mental model is always better than memorizing rules of thumb, but just to help out, here are few concrete lessons that can derived from this thought experiment:
ingredient
can and should be eliminated if you have a second index on ingredient
→cuisine
. More generally, if you have an index on a
→b
, then a second index on a
alone will be redundant.One final note is to bear in mind that the cookbook analogy can be taken only so far. It’s a model for understanding indexes, but it doesn’t fully correspond to the way that B-tree indexes actually work (B-trees are the data structure used to represent indexes in most databases, including MongoDB). The ideas presented here generally hold true, but if you’d like more nuance, stay tuned for the next post, where I’ll introduce B-trees and build on the mental model begun here.