Saturday, April 18, 2009

Consistent hashing

From Wikipedia .. 

Consistent hashing is a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. By using consistent hashing, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.

Consistent hashing was introduced in 1997 as a way of distributing requests among a changing population of web servers. Each slot is then represented by a node in a distributed system. The addition (joins) and removal (leaves/failures) of nodes only requires K/n items to be re-shuffled when the number of slots/nodes change. More recently it has been used to reduce the impact of partial system failures in large web applications as to allow for robust caches without incurring the system wide fallout of a failure [1] [2].

However, the most significant application of consistent hashing has been to form the foundation of distributed hash tables (DHTs). DHTs use consistent hashing to partition a keyspace among a distributed set of nodes, and additionally provide an overlay network which connects nodes such that the node responsible for any key can be efficiently located.

From (

Let’s say you’re a hot startup and your database is starting to slow down. You decide to cache some results so that you can render web pages more quickly. If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key. If you only have 5 to 10 minutes allocated for this problem on your development schedule, you’ll end up using what is known as the naïve solution: put your N server IPs in an array and pick one using key % N.

I kid, I kid — I know you don’t have a development schedule. That’s OK. You’re a startup.

Anyway, this ultra simple solution has some nice characteristics and may be the right thing to do. But your first major problem with it is that as soon as you add a server and change N, most of your cache will become invalid. Your databases will wail and gnash their teeth as practically everything has to be pulled out of the DB and stuck back into the cache. If you’ve got a popular site, what this really means is that someone is going to have to wait until 3am to add servers because that is the only time you can handle having a busted cache. Poor Asia and Europe — always getting screwed by late night server administration.

You’ll have a second problem if your cache is read-through or you have some sort of processing occurring alongside your cached data. What happens if one of your cache servers fails? Do you just fail the requests that should have used that server? Do you dynamically change N? In either case, I recommend you save the angriest twitters about your site being down. One day you’ll look back and laugh. One day.

As I said, though, that might be OK. You may be trying to crank this whole project out over the weekend and simply not have time for a better solution. That is how I wrote the caching layer for Audiogalaxy searches, and that turned out OK. The caching part, at least. But if had known about it at the time, I would have started with a simple version of consistent hashing. It isn’t that much more complicated to implement and it gives you a lot of flexibility down the road.

The technical aspects of consistent hashing have been well explained in other places, and you’re crazy and negligent if you use this as your only reference. But, I’ll try to do my best. Consistent hashing is a technique that lets you smoothly handle these problems:

  • Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down the line) server for the resource?
  • If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity?
  • How do you smoothly add capacity to the system without downtime? Specifically, this means solving two problems:
    • How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on?
    • How do you avoid rehashing more existing keys than necessary?

In a nutshell, here is how it works. Imagine a 64-bit space. For bonus points, visualize it as a ring, or a clock face. Sure, this will make it more complicated when you try to explain it to your boss, but bear with me:


That part isn’t very complicated.

Now imagine hashing resources into points on the circle. They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes. Just run them through MD5 or SHA and shave off everything but 8 bytes (and if anyone tells you that you shouldn’t use MD5 for this because it isn’t secure, just nod and back away slowly. You have identified someone not worth arguing with). Now, take those freshly minted 64-bit numbers and stick them onto the circle:


Finally, imagine your servers. Imagine that you take your first server and create a string by appending the number 1 to its IP. Let’s call that string IP1-1. Next, imagine you have a second server that has twice as much memory as server 1. Start with server #2’s IP, and create 2 strings from it by appending 1 for the first one and 2 for the second one. Call those strings IP2-1 and IP2-2. Finally, imagine you have a third server that is exactly the same as your first server, and create the string IP3-1. Now, take all those strings, hash them into 64-bit numbers, and stick them on the circle with your resources:


Can you see where this is headed? You have just solved the problem of which server to use for resource A. You start where resource A is and head clockwise on the ring until you hit a server. If that server is down, you go to the next one, and so on and so forth. In practice, you’ll want to use more than 1 or 2 points for each server, but I’ll leave those details as an exercise for you, dear reader.

Now, allow me to use bullet points to explain how cool this is:

  • Assuming you’ve used a lot more than 1 point per server, when one server goes down, every other server will get a share of the new load. In the case above, imagine what happens when server #2 goes down. Resource A shifts to server #1, and resource B shifts to server #3 (Note that this won’t help if all of your servers are already at 100% capacity. Call your VC and ask for more funding).
  • You can tune the amount of load you send to each server based on that server’s capacity. Imagine this spatially – more points for a server means it covers more of the ring and is more likely to get more resources assigned to it.

    You could have a process try to tune this load dynamically, but be aware that you’ll be stepping close to problems that control theory was built to solve. Control theory is more complicated than consistent hashing.

  • If you store your server list in a database (2 columns: IP address and number of points), you can bring servers online slowly by gradually increasing the number of points they use. This is particularly important for services that are disk bound and need time for the kernel to fill up its caches. This is one way to deal with the datacenter variant of the Thundering Herd Problem.

    Here I go again with the control theory — you could do this automatically. But adding capacity usually happens so rarely that just having somebody sitting there watching top and running SQL updates is probably fine. Of course, EC2 changes everything, so maybe you’ll be hitting the books after all.

  • If you are really clever, when everything is running smoothly you can go ahead and pay the cost of storing items on both their primary and secondary cache servers. That way, when one server goes down, you’ve probably got a backup cache ready to go.

Pretty cool, eh?

I want to hammer on point #4 for a minute. If you are building a big system, you really need to consider what happens when machines fail. If the answer is “we crush the databases,” congratulations: you will get to observe a cascading failure. I love this stuff, so hearing about cascading failures makes me smile. But it won’t have the same effect on your users.

Finally, you may not know this, but you use consistent hashing every time you put something in your cart at Their massively scalable data store, Dynamo, uses this technique. Or if you use, you’ve used a great combination: consistent hashing + memcached. They were kind enough to release their changes, so if you are using memcached, you can just use their code without dealing with these messy details. But keep in mind that there are more applications to this idea than just simple caching. Consistent hashing is a powerful idea for anyone building services that have to scale across a group of computers.

A few more links:

No comments: