Really Bad reasons not to auto-scale cloud based systems

O'Reilly writer George Reese posted today what I consider to be a poor evaluation of the perils of auto-scaling in the cloud.

He does mention the concept of using a governor to limit the power of the auto-scale agent to spin up servers (and spend money), but his insight ends there. Anyone following cloudy issues will have read Don MacAskill's excellent post this past June, where he explains their auto-scale operation, and the need to set limits.

George also makes a few arguments against auto-scaling, which I'll address briefly:

1. Amazon and other clouds cannot respond fast enough to increased capacity needs.

George claims that a 10 minute instance spin up time cannot respond fast enough to help. This is only true if you start to spin up your service when the existing is already (or nearly) toast. Common strategies involve already having some extra capacity running, so as to not immediately fold under an increase. Solving this problem is just tuning the thresholds.

2. Got any disgruntled employees, unhappy customers, or malicious competitors?

George claims that auto-scaling will waste your money in the event of a denial-of-service attack. What he doesn't mention is that a DoS on a non-auto-scaled system will likely take it down. At the very least, it will artificially inflate your usage anyway, and you will still have to spin up more resources to handle the load. I'd rather spend a few extra bucks and STAY UP.

3. So you think you'll stick some governors in place...

George's main claim here is that your governor is likely to be set at the wrong value. Although he doesn't explicity say, he seems to be implying that a governor can only be used to limit the total number of machines. SmugMug (in the aforelinked post) indicates that their governor limits the rate at which new machines can be started. Using this strategy, only the rate of traffic growth.

4. So what about getting slashdotted?

The main complaint here is that an auto-scale agent cannot tell the difference between true traffic growth and a random spike. Clearly, George has never worked with noise filters, which smooth data to reveal real trends. Evaluating load data from the past few minutes will allow agents to ignore spikes easily. Again, this is reduced to tuning thresholds.

5. Don't you lose a key value of the cloud without auto-scaling?

Despite George's claims that no value is lost, there are clear cases where auto-scaling can save your bacon. He claims that 'capacity planning' is the clear answer. I agree with him on the importance of capacity planning, but disagree that proper capacity planning eliminates the need to auto-scale. A good auto-scaling system can save quite a lot of money in cloud processing expenses, which will do wonders for the bottom line.

Summary

I'm not bashing capacity planning here. I believe that capacity planning concepts work very well with auto-scaling, that that proper user of governors and properly set thresholds are the right way to go.

I rarely respond to lousily written posts and dumb opinions, but this one irked me for some reason. At this point, I have nothing but logic and the experiences of others to rely upon. Over the next few years, I plan on gaining some extensive experience in auto-scaling cloud based systems, and perhaps then I'll be in a better position to dish a proper smack-down.

Comments
tom's Gravatar Sam,
2. Got any disgruntled employees, unhappy customers, or malicious competitors?
Your point is little confusing here, If some body does a continious DOS attack for days, my credit card will be charged thinking a real usage. Is there a DOS attack stopper or finder in any cloud offerings?
# Posted By tom | 12/6/08 10:48 PM
Sam Curren's Gravatar Thanks for asking for a clarification.

What I was trying to point out was that a DoS will cause problems on both auto-scaled and non-auto-scaled systems. A long-term level of false traffic will require a greater number of servers the same way as an auto-scaled service. A short-term level of false traffic (a typical DoS) will bring down a non-auto-scaled service, while an auto-scaled service will remain up, even though it costs more.

As I said, I'd rather pay the extra to keep the site up. So, if you have disgruntled employees, unhappy custoemrs, or malicious competitors, auto-scaling is EXACTLY what you need.

As an aside, how much traffic would they throw at you? if you can handle normal load with 1 server, and they throw 10 times the normal traffic at you, then your EC2 charges will be (*gasp*) $1 an hour till you can block the DoS attack. If you normally run 10 servers, and they cause a scale to 100 servers (which is quite a lot of load), then you are only paying $10 an hour to stay up under the attack.

In either situation, you are likely to have a monitoring system altering you to the increased traffic, which will allow you to monitor and adjust. Auto-scaling just gives you a faster response to the situation which is likely to minimize any downtime experienced.
# Posted By Sam Curren | 12/7/08 7:28 AM
PeterKraus's Gravatar Hi.
Isn't tuning the tresholds basically... capacity planning?
Peter
# Posted By PeterKraus | 12/7/08 8:25 AM
Sam Curren's Gravatar Yes, Peter, I believe that tuning thresholds can be considered to be part of capacity planning. I believe (and state in the summary above) that capacity planning and auto-scaling are two strategies that work well together.
I'm not trying to argue that capacity planning is unnecessary, but that proper capacity planning does not eliminate the need or use of auto-scaling.
# Posted By Sam Curren | 12/7/08 12:27 PM
Michael Benson's Gravatar Well said, Sam. While reading the initial article, I was thinking how much I disagree with his points. True, while using cloud computing can be a very bad idea at times, it is the right tool for many applications. I think part of his backlash might have been seeing too many systems designed with auto-scaling used as a "new hammer". One could write an article about how objects should not be used if all the reference you had was bad examples.

One other massive benefit to using auto scaling systems is development/setup costs. Lets say, hypothetically, you were trying to build a system to transcode videos for a small startup video sharing website. Transcoding is a simple, yet time consuming process. Trying to buy all the servers necessary to keep the system running while still living under venture capital would be a nearly impossible task. The cost of the servers plus hosting could make it cost prohibitive. However, with a cheap cloud computing system, such as EC2, you can complete the project and keep costs to a minimum while still allowing the system to grow until you can 1) afford the servers to do the entire system in house and 2) gather enough info on trends to do proper capacity planning.

The second benefit mentioned above is development costs. Auto scaling allows you to keep each individual node type in the system simple to develop and easy to maintain. Additionally, it helps solve some of the initial unforeseen design problems that the system may have so that the transition from an auto-scaling cloud based system to a true dedicated system much easier. And at this point, you will (hopefully) have the project making money so that you can afford the capital costs associated with a true dedicated system.
# Posted By Michael Benson | 12/8/08 10:09 PM
Brad's Gravatar Great rebuttal. I think it's ridiculous the way he completely wrote off auto scaling, but then goes on to say "oh well it's okay here". I think the main point here is both automatic scaling and capacity planning have their places. Using one and not the other could mean you're either spending too much money or aren't providing an appropriate level of service for your customers. It behooves anyone to use both in balance. Capacity planning for manually scaling the level of service and auto scaling for unpredictable or (hopefully not) missed capacity problems.
# Posted By Brad | 1/25/09 8:00 PM