This post is part of a series of posts relating to distributed system design that I'm completing as part of my Computer Science MS program at BYU.
As systems grow, the task of communication between nodes in the system can become a herculean task. Frequently, the limitations of the communication protocol serve as the upper bound of system size despite plentiful system resources. Many methods exist for allowing communication to scale with system size, but most succeed only in raising the threshold for system size.
Epidemic communication protocols, also known as gossip protocols, approach the communication problem from a completely different angle. I read about these concepts in a 2002 paper entitled The Power of Epidemics: Robust Communication for Large-Scale Distributed Systems [pdf] by Werner Vogels, Tobbert van Renesse, and Ken Birman. Whereas many other protocols follow rigid and organized communication patterns, Epidemic protocols rely on random communication between nodes to achieve rapid and robust information spread.
Naming these protocols after patterns of disease spread and the gossip patterns of human social groups is not an accident. These protocols have the advantage of being very simple because of our familiarity with these situations in society. Just as disease spreads through a population through random contact between two society members, information spreads between two nodes as each node chooses another at random with which to compare state. Gossip protocols rely on frequent communication between nodes to spread new information. Putting these two concepts produces the system described by Vogels et. al.
Each node selects another node from it's stored membership list once per second. The two nodes transfer a representation of state quickly, and upon discovery of a mismatch send updates to the other party. When a single node becomes aware of new information, it will share that information within one second to another node. The two copies become 4 copies, then 8, 16, 32, and so forth for each second passed. There are three important properties of these systems: speed and reliability, and constant load
Speed
In my previous example of spread, I assumed a doubling of information for each time period. This is in fact low, as on average each node will be in contact with two other nodes per time period, causing the number of informed nodes to grow by a factor of 4. One copy becomes 4 copies becomes 16, 64, 256, and so on. This growth can spread information even throughout very large systems in a very short time. Assuming our average of two gossip connections per second, the theoretic message spread is 1 million nodes after 10 seconds. For a system of 250 thousand nodes, the chance of a node NOT receiving the information update within 10 seconds becomes vanishingly small
Reliability
The paper introduces the concept of probabilistic guarantees. While it sounds counter-intuitive, considering the probability that a node will NOT have received an update within a period of time isn't insane. The probability is so low, given even only a few seconds, that it exceeds the probability of system and network failures.
Constant Load
The really cool thing about these types of protocols is the constant load required of each machine at any scale. Used in a cluster of 3 nodes, each node talks to (on average) 2 other nodes per time period. Expand that cluster to 250,000 nodes, and each node talks to ...... 2 nodes per time period. Few communication protocols operate under such constant load. Because of the randomness required for node selection, each node may talk to only one node or quite a few during a single period. Measured over a few time periods, the load will appear to be quite constant.
So there you have it. Next time you need to design a scalable communication system, consider giving your system the flu.
The gossip nature of the system isn't complicated at all. The complication is in the process of exchanging information between two systems as part of a round of gossip. There really isn't anything hidden here, although large systems will certainly need some fine engineering to operate well.
If you haven't read the linked paper, that will also be worth your time.
* do you have an "overseer" that adds nodes, to make sure the new node(s) are part of the system e.g. get a link from another node? This "overseer" would then have to have knowledge of all nodes, right?
* do you only allow new nodes to be added from existing nodes (which means that they'll be at least linked) - if so, how do you tell a node to add a node?
Cheers, Thomas
1. With a system deployed on the public internet, you can use an external resource to register a few known servers, such as DNS. DNS is usually to slow to use to manage the complete list, but it will allow discovery into an existing node. Registering multiple nodes for the same domain name would allow round-robin assignment of this first gossip partner.
2. With a system deployed on a private network, you could send the new node trick-or-treating through the existing address space of possible nodes, looking on a known port for system type. You could also have nodes randomly send a broadcast packet saying 'hey! I'm a node of cluster type x!' These broadcast messages would need to be sparse enough to keep network traffic down, but frequent enough to maintain an acceptable bootstrap delay.
Hope that gets the gears started. The solution you pick will have to be specific to your particular system, of course.