Chapter 8 ■ CaChes and Message Queues
14 2
The hash() function is Python’s own built-in hash routine, which is designed to be blazingly fast because it is
used internally to implement Python dictionary lookup. The MD5 algorithm is much more sophisticated because it
was actually designed as a cryptographic hash. Although it is now considered too weak for security use, using it to
distribute load across servers is fine (though slower than Python’s built-in hash).
The results show quite plainly the danger of trying to distribute load using any method that could directly expose
the patterns in your data.
$ python hashing.py
alpha
server0 35285 0.36
server1 22674 0.23
server2 29097 0.29
server3 12115 0.12
hash
server0 24768 0.25
server1 25004 0.25
server2 24713 0.25
server3 24686 0.25
md5
server0 24777 0.25
server1 24820 0.25
server2 24717 0.25
server3 24857 0.25
You can see that distributing load by first letters, where each of the four bins has a roughly equal number of
letters assigned to it, results in server 0 getting more than three times the load of server 3, even though it was assigned
only six letters instead of seven letters! The hash routines, however, both performed like champions. Despite all of the
strong patterns that characterize not only the first letters but also the entire structure and endings of English words,
the hash functions scattered the words evenly across these four fictional servers.
Though many data sets are not as skewed as the letter distributions of English words, sharded databases like
Memcached always have to contend with the appearance of patterns in their input data.
Listing 8-1, for example, was not unusual in its use of keys that always began with a common prefix and that
were followed by characters from a restricted alphabet: the decimal digits. These kinds of obvious patterns are why
sharding should always be performed through a hash function.
Of course, this is an implementation detail that you can often ignore when you use a database system like
Memcached whose client libraries support sharding internally. But if you ever need to design a service of your own
that automatically assigns work or data to nodes in a cluster in a way that needs to be reproducible between several
clients of the same data store, then you will find the same technique useful in your own code.
Message Queues
Message queue protocols let you send reliable chunks of data that the protocols call messages instead of datagrams
since, as you saw in Chapter 2, the idea of a datagram is specific to unreliable services where data can be lost,
duplicated, or reordered by the underlying network. Typically, a message queue promises to transmit messages
reliably and to deliver them atomically: a message either arrives whole and intact, or it does not arrive at all. Framing
is performed by the message queue protocol itself. Your clients using the message queue never have to loop and keep
calling something like recv() until a whole message has arrived.