Planning Ahead: Resilient Load Variations in Distributed Stream Processing

This post is part of a series of posts relating to distributed system design that I'm completing as part of my Computer Science MS program at BYU.

Background

Stream processing systems are used to process data and provide low latency data derived from source streams. Good examples of this processing include stock market updates, highway traffic data, and network traffic analysis. Stream processing systems take a stream of data and run it through a series of operations. Most of these operations perform data conversions or calculations/reductions of the source data. The final result is either a new data stream or an updated data set.

Small stream processing operations usually operate on a single machine. When the load grows beyond the power of the host machine, the stream processing operations must be distributed between several machines. Adding processing power enables the system to maintain low-latency processing. This latency is the time that data takes to travel from system input to system output. The task of dividing the processing operations onto multiple machines is difficult, particularly in situations with variable load.

Early distributed systems were managed manually, and changes to the operator spread were difficult and often required downtime. Research into dynamic operator distribution allowed systems to move operators between machines to evenly distribute the processing load in response to changing load.

Research

In a 2006 paper titled "Providing Resiliency to Load Variations in Distributed Stream Processing" (pdf), Ying Xing et. al. describe a new approach to handling variable load.

They observe that the cost of moving operators in a dynamic distribution system is very high. They outline a process of choosing an initial operator distribution such that the range of manageable load scenarios is maximized. This reduces the need to migrate operations between machines in a system. By maximizing the range of load that a system is capable of handling, they can reduce and possibly eliminate the need to support dynamic operator migration.

First, they begin by modeling the operators present in the distributed processing system, and then evaluate different methods of choosing the optimal distribution. In addition to reporting the optimal distribution, their method reports on the range of load that they system can handle without operator migration.

While their work is designed to reduce the need to migrate operators, they explain that their algorithms cooperate well with dynamic systems. By choosing an initial distribution resilient to load variation, they reduce the need to perform expensive operator migrations. When a migration is needed, these algorithms can be used to recommend the distribution most resilient to the new expected loads.

Response

This paper, along with the previous work described, operates under the assumption that adding hardware to the cluster is a time consuming operation. This assumption leads to the static thinking that the only operation available is to move operations from one existing node to another.

Cloud computing allows spinning up hardware in just a few minutes to help in handling variation in system load, and then the ability to shut down the systems when load has decreased to reduce cluster costs. With the concepts in this paper, the system can decide when additional hardware is needed, and choose the best set of operators to migrate to the new hardware to maintain desired latency levels.

As cloud computing is a fairly recent development, I expect a few years to pass before solid work is published describing cloud friendly methods for load distribution.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
louis vuitton bag's Gravatar your artical is useful for me.thank you.
# Posted By louis vuitton bag | 5/17/10 10:36 PM