Tuesday, December 11, 2012

We Won't Fail, and It Won't Be Fast

"Let's try it, and fail fast" may not always be a wise approach.

We had an issue: our project was being asked to carry on a piece of work in a predetermined way. We had fundamental disagreements with the proposed approach. We thought it was inefficient, adds unnecessary complexity, and a lot of waste. However, we faced considerable pressure to go along, given certain organizational and political constraints.

The idea was coined: "Let's just try it, and fail fast, when it becomes clear it's really not adequate for the task at hand." As appealing as this might seem, it embodies considerable risk.

Let's say we did that. We proceed developing the application the way we are asked. We waste our time building this interface, integrating to this system, persisting transient, in-process data elements, etc. Now, at what point exactly do we fail fast?

Consider the following two observations:
  • We always want to make things work, and
  • Waste is not always acknowledged as such.
Even when we work in a less than ideal environment, we always want to make the best of it. Even when our team is asked to accomplish a task while unfairly constrained, we will work hard to reach the best outcome. We will make it work, whatever the cost might be. What we end up with is, still, far from ideal, but we will go the distance, because we don't like to fail.

When we then try to show the powers that be how much more complexity or waste there is, it's unlikely that it will be acknowledged as such. For example, if your team's productivity was hindered by a requirement to keep reams of documentations uptodate all the time at every step along the way, with no consumer of this information whatsoever, someone, likely from the powers that be circle, will rise to assert how much valuable that will eventually be.

The world of large enterprises today has lots of hidden inefficiencies, originating from silo-ed teams, and competing divisions. While we should always challenge this state, we should also understand when failing fast is not a realistic option.

Saturday, June 2, 2012

Profiling Lazy Evaluations

The point: Lazy sequence evaluations render the results of simple code profiling useless. Alternate techniques must be devised to correctly find code hot spots.

Recently, I coded a function in Clojure that wasn't performing fast enough. Admittedly, I'm new to the language, so I tried the techniques I'm familiar with to find where the function is spending most of its execution time. I used the time function, as well as the profile library from clojure.contrib, wrapping the pieces of code from the outside in, trying to close down on the slow parts. After some time, I was not getting any good information. It seemed that at some point, I lose any indication that code is spending any meaningful time executing the code being wrapped up.

That, until I finally noticed something very telling.

In an attempt to speed up a piece of code, I was checking to see of a collection is empty, before processing further. The code looked like this: (empty? coll), and it was taking one second to execute!

Obviously, something was wrong, and that was my understanding of how I can effectively profile this code. Since most of the underlying code was using lazy functions and sequences by default, the empty check caused a chain of delayed execution functions to activate. Well, mystery solved then, but how do I effectively profile this code to find where the time is being spent?

The most reliable measurements I got were from functions that are self contained: those that process all of its input collections to return a number, for example, or those that didn't process collections at all. reduce also exhausts collections. Calling count has a similar effect. I was able to use these techniques to enhance the reported time because I new that my code will process all the items in the underlying collections any way, so forcing the realization of the full collection won't change the overall time. But what if that wasn't the case?

So far, I don't have a great answer. For the time being, I'll resort to measuring the most elementary of operations in the code, as they provide the most reliable information.

Thursday, March 29, 2012

Instance Thrashing with Amazon EC2 Autoscaling

This post explains a situation our team encountered when we tried to use ec2 auto scaling, for a particular application. We didn't ended up using auto scaling. Instead, we allocated enough instances in advance to respond to anticipated load.

In a nutshell: Automatically added instances could not handle the high load, and were immediately removed from the pool, rendering auto scaling unusable.

The graph depicts CPU utilization for 30 ec2 large instances.
We started with 4 instances, allowing the group to grow to up to 30 instances. The load test was designed to gradually increase the load to the maximum expected, and then sustaining it at that level for a period of time.
The 4 instances behaved as expected, until the point where we experienced request timeouts. New instances were added to the group as expected, but we didn't observe an improvement in response time, or a reduction in dropped requests. This continued until the load was eventually reduced.
During the high load period, we kept querying for the number of healthy instances, and always found the number too low. The average was about 7.
It's not immediately apparent in the graph above, but it does show instances coming into the pool, at which point their CPU utilization peeks, then soon after, the same instances' CPU utilization going down.
We expected that with new instances added to the pool, that CPU utilization will be reduced across all instances. Since all instances are configured the same way, we expected more or less the same CPU utilization across the group. Also, we expected the response time to go down in a similar pattern.
The average response time stuck at 60 seconds, which is a signal that instances are dropping requests. The request timeout was set at 60 seconds, thus skewing the average to this number.
This behavior led us to conclude that our instances could not start up normally under a heavy load. Further investigation was certainly due, but we never got around to it. It was deemed safer to keep enough instances up to respond to the anticipated load, which we confirmed in a subsequent load test.