3/12/2012

Why it's impossible to reliably count all records/documents/keys in a distributed data store

When you are not familiar with the theoretical aspects of distributed systems, but are using or going to use a distributed data store of whatever kind (sharded MongoDB, Riak etc.), you still need to understand some aspects. Then you won't wonder if some of the things you're used to when building on a single-instance or central data store without any replicas suddenly don't work reliably or are being told to be unreliable in the distributed world.

One of those aspects is the global record/document/key (however your store calls it) count. Let's agree on item count as a term. In a distributed data store with active replication it's simply impossible to always have a reliable information how many items your store keeps.

In order to better understand why it is so, imagine a global population census. Let's start in one single country. They delegate the counting process to single towns, then districs and so on. This is nothing that can be done within a millisecond. It takes time. They collect the local information during the counting days and send it upwards and so on. After some weeks, they are done.

Now consider the world wide scenario. It will even take much longer to collect all the local country counters at one central place. But anyway, one day this information is there.

Is the count of people on the planet reliable then? Imagine that while they are counting, people get born, move from one country to another and, I'm sorry, die. That means that what was possible is to have a global snapshot of the world population, each part of this snapshot is snapshot itself and isn't worth the paper it's written on after it has been done - so much has changed. People have been double counted or just didn't pop up for whatever reason.

But for the global scenario it's still ok to have a guess, with some tolerance. That's how your distributed data store would count the items it stores.

But it's still a stupid idea to count them. Remember the population census - counting people world wide is a very expensive and time consuming process. Exactly the same problem would face your distributed data store when you start counting its overall items.

And even when you decide to do so, it's still a guess - you only can do this asynchronously and unreliably, as a bunch of snapshots of different reliability and quality. Some nodes in the distributed system can go down or become unavailable (what normally doesn't happen to countries or cities) at the moment you try to count the items.

So, the only way to get reliable snapshots (only from available nodes) is to do it synchronously. But this is a much more stupid idea than to count the items at all. And when they want to synchronously count people on Earth, they need to plug GPS transmitters into every single body. And still, it doesn't solve the problem of having no GPS connection or just people getting born but still being unregistered. As well as, I'm sorry again, people dying.

Anyway, I hope I could help to understand the problem. Any feedback is welcome.

No comments: