Thursday, March 29, 2012

Instance Thrashing with Amazon EC2 Autoscaling

This post explains a situation our team encountered when we tried to use ec2 auto scaling, for a particular application. We didn't ended up using auto scaling. Instead, we allocated enough instances in advance to respond to anticipated load.

In a nutshell: Automatically added instances could not handle the high load, and were immediately removed from the pool, rendering auto scaling unusable.

The graph depicts CPU utilization for 30 ec2 large instances.
We started with 4 instances, allowing the group to grow to up to 30 instances. The load test was designed to gradually increase the load to the maximum expected, and then sustaining it at that level for a period of time.
The 4 instances behaved as expected, until the point where we experienced request timeouts. New instances were added to the group as expected, but we didn't observe an improvement in response time, or a reduction in dropped requests. This continued until the load was eventually reduced.
During the high load period, we kept querying for the number of healthy instances, and always found the number too low. The average was about 7.
It's not immediately apparent in the graph above, but it does show instances coming into the pool, at which point their CPU utilization peeks, then soon after, the same instances' CPU utilization going down.
We expected that with new instances added to the pool, that CPU utilization will be reduced across all instances. Since all instances are configured the same way, we expected more or less the same CPU utilization across the group. Also, we expected the response time to go down in a similar pattern.
The average response time stuck at 60 seconds, which is a signal that instances are dropping requests. The request timeout was set at 60 seconds, thus skewing the average to this number.
This behavior led us to conclude that our instances could not start up normally under a heavy load. Further investigation was certainly due, but we never got around to it. It was deemed safer to keep enough instances up to respond to the anticipated load, which we confirmed in a subsequent load test.