The Power of Epidemics

This post is part of a series of posts relating to distributed system design that I'm completing as part of my Computer Science MS program at BYU.

As systems grow, the task of communication between nodes in the system can become a herculean task. Frequently, the limitations of the communication protocol serve as the upper bound of system size despite plentiful system resources. Many methods exist for allowing communication to scale with system size, but most succeed only in raising the threshold for system size.

Epidemic communication protocols, also known as gossip protocols, approach the communication problem from a completely different angle. I read about these concepts in a 2002 paper entitled The Power of Epidemics: Robust Communication for Large-Scale Distributed Systems [pdf] by Werner Vogels, Tobbert van Renesse, and Ken Birman. Whereas many other protocols follow rigid and organized communication patterns, Epidemic protocols rely on random communication between nodes to achieve rapid and robust information spread.

Naming these protocols after patterns of disease spread and the gossip patterns of human social groups is not an accident. These protocols have the advantage of being very simple because of our familiarity with these situations in society. Just as disease spreads through a population through random contact between two society members, information spreads between two nodes as each node chooses another at random with which to compare state. Gossip protocols rely on frequent communication between nodes to spread new information. Putting these two concepts produces the system described by Vogels et. al.

Each node selects another node from it's stored membership list once per second. The two nodes transfer a representation of state quickly, and upon discovery of a mismatch send updates to the other party. When a single node becomes aware of new information, it will share that information within one second to another node. The two copies become 4 copies, then 8, 16, 32, and so forth for each second passed.  There are three important properties of these systems: speed and reliability, and constant load

Speed

In my previous example of spread, I assumed a doubling of information for each time period. This is in fact low, as on average each node will be in contact with two other nodes per time period, causing the number of informed nodes to grow by a factor of 4. One copy becomes 4 copies becomes 16, 64, 256, and so on. This growth can spread information even throughout very large systems in a very short time. Assuming our average of two gossip connections per second, the theoretic message spread is 1 million nodes after 10 seconds. For a system of 250 thousand nodes, the chance of a node NOT receiving the information update within 10 seconds becomes vanishingly small

Reliability

The paper introduces the concept of probabilistic guarantees. While it sounds counter-intuitive, considering the probability that a node will NOT have received an update within a period of time isn't insane. The probability is so low, given even only a few seconds, that it exceeds the probability of system and network failures.

Constant Load

The really cool thing about these types of protocols is the constant load required of each machine at any scale. Used in a cluster of 3 nodes, each node talks to (on average) 2 other nodes per time period. Expand that cluster to 250,000 nodes, and each node talks to ...... 2 nodes per time period. Few communication protocols operate under such constant load. Because of the randomness required for node selection, each node may talk to only one node or quite a few during a single period. Measured over a few time periods, the load will appear to be quite constant.

So there you have it. Next time you need to design a scalable communication system, consider giving your system the flu.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Thomas Williams's Gravatar Hi Sam - a very interesting article (and a cool idea). Do you know of any code examples demonstrating the "epidemics" principle?
# Posted By Thomas Williams | 12/4/08 6:58 PM
Sam Curren's Gravatar I haven't used this concept myself, so I don't have any code examples. I do know that these concepts are used as part of an internal system at Amazon called Dynamo (http://www.allthingsdistributed.com/2007/10/amazon...).

The gossip nature of the system isn't complicated at all. The complication is in the process of exchanging information between two systems as part of a round of gossip. There really isn't anything hidden here, although large systems will certainly need some fine engineering to operate well.

If you haven't read the linked paper, that will also be worth your time.
# Posted By Sam Curren | 12/4/08 8:41 PM
Thomas Williams's Gravatar Thanks for the link Sam (reading it now). I asked the question because I was interested to see what approaches there are when you need to add nodes to an existing network:
* do you have an "overseer" that adds nodes, to make sure the new node(s) are part of the system e.g. get a link from another node? This "overseer" would then have to have knowledge of all nodes, right?
* do you only allow new nodes to be added from existing nodes (which means that they'll be at least linked) - if so, how do you tell a node to add a node?

Cheers, Thomas
# Posted By Thomas Williams | 12/4/08 9:23 PM
Sam Curren's Gravatar Ahh, Bootstrapping. You need a way for a new node to discover existing nodes to begin the initial membership list swap. The paper only talked briefly about this, and I only have a few ideas off the cuff.

1. With a system deployed on the public internet, you can use an external resource to register a few known servers, such as DNS. DNS is usually to slow to use to manage the complete list, but it will allow discovery into an existing node. Registering multiple nodes for the same domain name would allow round-robin assignment of this first gossip partner.

2. With a system deployed on a private network, you could send the new node trick-or-treating through the existing address space of possible nodes, looking on a known port for system type. You could also have nodes randomly send a broadcast packet saying 'hey! I'm a node of cluster type x!' These broadcast messages would need to be sparse enough to keep network traffic down, but frequent enough to maintain an acceptable bootstrap delay.

Hope that gets the gears started. The solution you pick will have to be specific to your particular system, of course.
# Posted By Sam Curren | 12/4/08 9:39 PM
mbt shoes's Gravatar Homepage of MBT physiological footwear.Huge MBT shoes styles to choose from. Shipped Free for all mbts. http://www.mbts-mbtshoes.com
# Posted By mbt shoes | 6/5/10 1:57 AM