Tuesday, December 11, 2012

We Won't Fail, and It Won't Be Fast

"Let's try it, and fail fast" may not always be a wise approach.

We had an issue: our project was being asked to carry on a piece of work in a predetermined way. We had fundamental disagreements with the proposed approach. We thought it was inefficient, adds unnecessary complexity, and a lot of waste. However, we faced considerable pressure to go along, given certain organizational and political constraints.

The idea was coined: "Let's just try it, and fail fast, when it becomes clear it's really not adequate for the task at hand." As appealing as this might seem, it embodies considerable risk.

Let's say we did that. We proceed developing the application the way we are asked. We waste our time building this interface, integrating to this system, persisting transient, in-process data elements, etc. Now, at what point exactly do we fail fast?

Consider the following two observations:
  • We always want to make things work, and
  • Waste is not always acknowledged as such.
Even when we work in a less than ideal environment, we always want to make the best of it. Even when our team is asked to accomplish a task while unfairly constrained, we will work hard to reach the best outcome. We will make it work, whatever the cost might be. What we end up with is, still, far from ideal, but we will go the distance, because we don't like to fail.

When we then try to show the powers that be how much more complexity or waste there is, it's unlikely that it will be acknowledged as such. For example, if your team's productivity was hindered by a requirement to keep reams of documentations uptodate all the time at every step along the way, with no consumer of this information whatsoever, someone, likely from the powers that be circle, will rise to assert how much valuable that will eventually be.

The world of large enterprises today has lots of hidden inefficiencies, originating from silo-ed teams, and competing divisions. While we should always challenge this state, we should also understand when failing fast is not a realistic option.

Saturday, June 2, 2012

Profiling Lazy Evaluations

The point: Lazy sequence evaluations render the results of simple code profiling useless. Alternate techniques must be devised to correctly find code hot spots.

Recently, I coded a function in Clojure that wasn't performing fast enough. Admittedly, I'm new to the language, so I tried the techniques I'm familiar with to find where the function is spending most of its execution time. I used the time function, as well as the profile library from clojure.contrib, wrapping the pieces of code from the outside in, trying to close down on the slow parts. After some time, I was not getting any good information. It seemed that at some point, I lose any indication that code is spending any meaningful time executing the code being wrapped up.

That, until I finally noticed something very telling.

In an attempt to speed up a piece of code, I was checking to see of a collection is empty, before processing further. The code looked like this: (empty? coll), and it was taking one second to execute!

Obviously, something was wrong, and that was my understanding of how I can effectively profile this code. Since most of the underlying code was using lazy functions and sequences by default, the empty check caused a chain of delayed execution functions to activate. Well, mystery solved then, but how do I effectively profile this code to find where the time is being spent?

The most reliable measurements I got were from functions that are self contained: those that process all of its input collections to return a number, for example, or those that didn't process collections at all. reduce also exhausts collections. Calling count has a similar effect. I was able to use these techniques to enhance the reported time because I new that my code will process all the items in the underlying collections any way, so forcing the realization of the full collection won't change the overall time. But what if that wasn't the case?

So far, I don't have a great answer. For the time being, I'll resort to measuring the most elementary of operations in the code, as they provide the most reliable information.

Thursday, March 29, 2012

Instance Thrashing with Amazon EC2 Autoscaling

This post explains a situation our team encountered when we tried to use ec2 auto scaling, for a particular application. We didn't ended up using auto scaling. Instead, we allocated enough instances in advance to respond to anticipated load.

In a nutshell: Automatically added instances could not handle the high load, and were immediately removed from the pool, rendering auto scaling unusable.

The graph depicts CPU utilization for 30 ec2 large instances.
We started with 4 instances, allowing the group to grow to up to 30 instances. The load test was designed to gradually increase the load to the maximum expected, and then sustaining it at that level for a period of time.
The 4 instances behaved as expected, until the point where we experienced request timeouts. New instances were added to the group as expected, but we didn't observe an improvement in response time, or a reduction in dropped requests. This continued until the load was eventually reduced.
During the high load period, we kept querying for the number of healthy instances, and always found the number too low. The average was about 7.
It's not immediately apparent in the graph above, but it does show instances coming into the pool, at which point their CPU utilization peeks, then soon after, the same instances' CPU utilization going down.
We expected that with new instances added to the pool, that CPU utilization will be reduced across all instances. Since all instances are configured the same way, we expected more or less the same CPU utilization across the group. Also, we expected the response time to go down in a similar pattern.
The average response time stuck at 60 seconds, which is a signal that instances are dropping requests. The request timeout was set at 60 seconds, thus skewing the average to this number.
This behavior led us to conclude that our instances could not start up normally under a heavy load. Further investigation was certainly due, but we never got around to it. It was deemed safer to keep enough instances up to respond to the anticipated load, which we confirmed in a subsequent load test.

Thursday, July 22, 2010

A Story of a Software Project

Iteration 1: The team picks up the first stories, and makes good progress. The result is showcased to the customers. The mood is encouraging.
Iteration 2: The team churns a bit, trying to get the first stories closed and iron out some technology choices.
Iteration 3: Not enough stories are being closed. The team's velocity is lower than needed. The PM gets worried, and starts calling meetings and raising red flags. The PM declares that the team needs to catch up, and look for ways to increase velocity.
Iteration 4: The team makes good strides, and appears to be back on track. Nerves settle down a bit.
Iteration 5: The team's velocity is soaring. The PM says that given the current velocity, the team will meet its target date. The team starts taking care of technical debt.
Iteration 6: The business functionality is taking shape. The customers start to get a feel of how the system works. They start asking for modifications.
Iteration 7: The customers become more demanding. They notice some gaps between the functionality of the application, and what is needed to run the business. Defects start to creep in.
Iteration 8: The PM talks the business out of some of their demands, and the team devises workarounds for some outstanding issues.
Iteration 9: Faced with approaching deadlines, the PM asks developers to stay late and work over the weekends to finish the remaining tasks. The code quality suffers and technical debt increases.
Iteration 10: The team manages to finish all the remaining tasks. The application is put in production, with minor hiccups. Time to celebrate.


Does this sound familiar?
The team delivered on time. Is there a problem here?
The above pattern causes hardship for the team. The resulting code quality is rarely satisfactory. But we shouldn't be surprised or get overly worked out because of things that are really to be expected:
- Estimates are not always met, because they are estimates.
- The team takes more time in the first iterations because it's the first time this team tackles this problem.
- The customers don't like what they see the first time, because it's the first time they see it.
- The project is taking more time than expected because our expectations are just now being reality checked.
- The developers are being asked to work extra time because the team's management over-promised. However, the developers had no clue initially whether these promises can be met. Everything looked good on paper.

What's the way out of this?
This is not an easy problem. What makes it even more difficult is the fact that the team delivered after all, reducing the incentive for change. There are ways, however, to make things better:
- Educate all parties on all aspects of the project.
- Get the customers involved as early as possible.
- Manage all parties' expectations.
- Communicate regularly, and facilitate information sharing.
- Make it clear that the process of adaptation also includes dates and scope.
- Learn from the past. If you've seen this before, it's likely that you'll see it again unless you change your approach.

Thursday, November 5, 2009

Shouldn't We Local-Optimize at Bottlenecks?

The short answer is no. Once we start thinking local, we are heading down the wrong path.

Consider what we should do at a bottleneck:
  • Increase the resource's throughput, by increasing its efficiencies.
  • Manage the flow in the system to reduce idle time at the resource.
  • Add more capacity, by introducing other resources capable of the same function.
  • Outsource a portion of the work to resources outside the system.
  • Rethink the need for some work to go through the bottleneck.
You'd notice that only the first of these points is local in nature, and we should only consider it as an option. It may not be the best one.

Friday, October 23, 2009

What is Wrong with Local Optimization Anyway?

How could it be wrong to optimize anything, local or not?
Well, if by local optimization we mean having a resource in our system utilize an optimum amount of its inputs, to produce timely, sufficient, but not excessive, output to subsequent steps in the process, then there is nothing wrong, as long as this optimization contributes positively to the system's goal.
Note that timely, sufficient, and not excessive, output is defined by subsequent steps in the process. As such, this output might, at times, be zero.
Note also that optimizing the whole system may call for one step or process to be removed altogether.
If this is how we are approaching the problem of efficiency, then we are not actually doing local optimization.

Consider, however, the following approaches to optimization:
  • Increasing the resource utilization to 100%.
  • Getting the maximum possible throughput out of every resource.
  • Keeping everyone busy all the time.
  • Removing all idle time.
If this is our focus, then we are heading for trouble, and we are introducing a significant waste in our system.

To see why this is the case, consider the following consequences of increasing a resource's throughput in our system to the maximum:
  • More inventory to manage in subsequent processes, If these subsequent processes are not ready, or capable, of consuming all the output.
  • More load on subsequent processes, since now they will have more input to process.
  • Delays in getting urgent work done, since there is no slack in the system to handle occasional spikes, resulting from natural statistical variations.
  • More work being stuck at bottlenecks.
  • Increasing demand artificially on up-stream processes, since this demand is not driven by the needs of the market or the ultimate customers.
  • Increasing demand on resources required to maintain the high efficiency.
  • The process of optimization itself will consume resources. The overall gain may not exceed the cost.

Local efficiency, then, is a waste. One has to look for alignment with the system's goals to define what, where, how, and how much to optimize, weighing costs against benefits.

"But wait," you may remark, "how about local optimization at bottlenecks?", which is, granted, a nice try. But this will have to wait for another post.

Wednesday, October 7, 2009

Does Waterfall Make More Sense?

I came in contact with a few people who were actually content with waterfall.
A senior dev explained to me that waterfall is simple, everyone gets it, and it's easy to implement. There are well defined, easy, consecutive steps to be followed.
An upper manger was very keen to find ways to convince his company's leadership that waterfall fits his department really well, thus avoiding the drive to adopt agile. From his perspective, waterfall provided predictability. He new at the beginning of the year what his budget is, what projects he will be working on, and what the duration of each projects will be.

It also makes sense to design something before building it. If you don't design it before hand, how do you know what you will be building? How do you know how much it will cost? How do you decide if it's worth it? How do you compare it to other options?

In our day-to-day life, we demand predictability. Before we offer a job to a carpenter to install new kitchen cabinets, or ask a mechanic to service our car, for example, we want to know before hand how much it will cost, and how long it will take. We are really disturbed when either of these estimates are not met, although we know they are just estimates.

So what is the problem in expecting the same from software projects?
We can always give the example of an apparently simple job gone badly, as when an air conditioning engineer starts asking you when was the last time you cleaned your air vents, or changed the air filters, only to discover that you'd have to pay more and wait longer to have your air system fixed. Let's put this example aside for now.
Instead, consider the how likely is the change in your project, from inception to project end, in the following areas:
  • The business needs from your application.
  • The specified requirements from your application.
  • Your understanding of the requirements.
  • Technology.
  • The project team's mastery of the technology.
  • The people who are doing the work.
If your meter reads anything other than low for all the above, you should rethink waterfall.
Because waterfall makes sense for certain types of projects. Software development projects are a breed of their own, with a lot of sub-varieties within.
And change is inevitable in software projects, because you'll never build the same application twice, for the same business need, with the same people, who have the same experience, using the same technology, will you?