Friday, April 4, 2014

Word Request: Email is Failing to Advance our Understanding

So, here is the situation: we are having a discussion over email, and the discussion is now exhibiting the following characteristics:
  • It is getting nowhere.
  • Everyone seems determined in their ways.
  • We are pretty much talking over each other.
  • It’s as if we are communicating over different wave lengths.
We need to recognize the situation, and put an end to this.

And therein lies my humble request: a word that someone responds with that identifies the situation and calls for continuing the discussion over another medium.

I’ve seen the pattern above so many times that I believe labeling it will help us all communicate better. Here are some wordy descriptions that may be used:
  • Continuing this discussion over email is harmful.
  • Continuing this discussion over email will cause more harm than good.
  • This discussion thread has reached a point of negative return.
  • We have reached a point where we should continue this discussion using a different medium.
  • Email is not the best way to continue this discussion.
  • Email is failing to communicate the point.
  • Email is failing to advance our understanding.
  • Email fails to advance the subject.
Yes, we can make an acronym, but down with acronyms. The world doesn’t need another EFAU, or EFAS. Let’s keep that as a last resort.

Any ideas?

Tuesday, March 11, 2014

How to Make Feature Teams Work for You

Let's say that I'm a CTO, and my IT program management is a mess. I can't allocate budget based on business priorities and company-wide initiatives. So I decide to unify all my IT force into a single pool of resources, and divide them into short-lived feature teams, formed around my current priorities. As my priorities change, I dissolve some teams and form others. And it is like a dream come true: now I can allocate my budget easily, have predictability into my spend, and avoid wasting money into projects that are not aligned with my goals. Or is it?

Let’s look at this a bit in detail. Feature teams will either work on existing assets, or develop new ones, and perhaps decommission old ones in the process. Now, what will be the state of these assets while feature teams work on delivering to their respective initiatives?

Consider the following analogy. Let's say that every one of my IT assets, be it an application or a backend service, is an airplane. Some teams are building new airplanes. Other teams are adding capabilities to existing airplanes. Some airplanes are in-flight, delivering business value. Some airplanes require more maintenance than others. But each one has its specialized flying instructions and maintenance procedures. Consider my feature teams to be the flying and maintenance crews. Based on my business priorities, I assign crews to airplanes; building new ones, adding capabilities to existing ones, or decommissioning and replacing some, or part thereof, with others, or parts thereof. All the while keeping all airplanes flying.

No, wait, who is keeping the airplanes flying? That is not a business imperative of any of the feature teams. An implicit assumption, maybe? Is there a separate maintenance crew to keep all airplanes flying? How are we going to ensure that that feature teams will respect the flying procedure of each airplane?

And how are we going to ensure that feature teams will not step on each other's toes? How do we ensure that the flying integrity of each airplane is not compromised?

If someone would tell me that this is similar to the model popularized by Spotify, then I would respectfully beg to differ. And in short, we can't simply wish for all the benefits of feature teams, while ignoring all the challenges above.

Let's agree on one thing first: I should be able to shift some of my budget and resources to respond to my current priorities. And at the same time, I should keep all my assets healthy, and all my airplanes in a perfect flying condition.

How can achieve both these goals?

It is imperative that I keep a focused, long lived team around every one of my assets. The size of this team will depend on the complexity of each asset. We can even have a team be responsible for multiple assets.
In addition to these asset teams, we can still have our feature teams. But if a feature team would like to modify an asset to deliver a business need, they will have to embed into, or extend, the core asset team. This extended team will have to work in harmony to preserve the integrity of the asset, ensure the compatibility of the new implementation with the existing architecture, and alignment with the long term goals for that asset. Any changes will need to be approved by the core team. If they need to change multiple assets, then they'll have to divide up, or work on one at a time. There could be multiple feature teams extending the same core team for one asset. If this is the case, or if there is an asset that undergoes continuous changes by multiple feature teams, consider expanding this asset's core team.

This way, you'll ensure that all your airplanes are in a sound flying condition, and that none will go crashing on you, because of the lack of maintenance, or because different crews trying to pull them in different directions.

Friday, January 17, 2014

Successful Distributed Development: Discipline, Awareness, and Initiative

For many of today's IT organizations, distributed development is not optional. Rather it is a fact of life. We should be mindful of the limitations introduced by this mode of operation, while working constantly to mitigate them. It is a challenge that we have to actively tackle. Focusing solely on the tools misses the opportunity for true collaboration. Tools should be considered an enabler; we won't be a able to work remotely without tools. But having the tools, by itself, doesn't guarantee success.

It's important for us to understand what the ideal is, so that we can strive to come as close as possible to achieve it, given whatever constraints our work environment imposes on us. There is no substitute for face to face interaction. There is nothing better than a co-located, cross-functional team. While we design our work activities, we should strive to converge as much as possible to these ideals.

I'm guessing all of this is not new to you. Yet time and again, distribute projects suffer from communication breakdowns, misunderstandings, unmet expectations, among many other dysfunctions, often resorting to heavy processes that only make things worse. There are successful distributed teams, however. Below are few of the traits exhibited by those teams. Adopting these traits in your distributed environment can help you converge more to the ideal.

Discipline

I often hear project leaders complain that their teams are not working very well together despite having state of the art telecommunication tools. It is valid to question the tools adequacy, ease of use, usability, etc. It is more important to observe when, how, and even if, the team is using them. Compare your current situation to that of a co-located team. Play a what if scenario: what if everyone had been working in the same room?
You will need a dedicated facilitator, a catalyst of sorts, in every one of your locations, to keep nudging people to reach out to each other. It is not fair to simply expect everyone to remember to get out of their immediate challenge to seek help, or to solicit a different perspective. Software development is an intellectually demanding profession, and it is not uncommon for people to get consumed by their immediate task, and forget to reach out. There are a set of dynamics that can only happen if the whole team is co-located. You could hear a couple of coworkers arguing about a problem you solved yesterday, or perhaps solving a problem you'll face tomorrow. You could have just come out from a planning session, with a new understanding of the product vision, and you may just share that with your colleagues. And there is where the team's facilitator role comes in: to play the above what if scenario. How about we tell the other offices about this? How about we consult with other locations to see if they encountered something similar? But it doesn't stop there.

A disciplined distributed team will embrace certain values and adhere to a set of practices that ensure that the whole team operates as a single, cohesive unit. It's easy for us to fall back to our comfort zones, or be consumed by tasks, than to always remember to reach out. The facilitators role is to make sure that the distributed team never misses an opportunity to act as a co-located one, whenever possible.

Awareness

Let's face it, we don't naturally know how to effectively work in a distributed environment. We may know we have teammates in other locations. We may be curious how they spend their days, wanting to learn the challenges they face, and looking forward to meeting them in person. This doesn't mean that this knowledge will be translated into changing how we perform our day to day work to adapt and address the limitations of this setting.

Well, our designated facilitator is there to change this reality. By actively seeking opportunities for cross-site collaboration, events, and feedback sessions, they continuously raise the whole team's awareness that a distributed setting requires a different mode of operation.

It's only when all the team members are fully aware, that location barriers start coming down. There are certain signs you can watch for to assess whether the team has reached such a level. For example, if you hear statements like "let's wait until you are here next week to discuss this", "I couldn't really explain myself to the other team over the video conference", or "I couldn't really tell if they were happy or mad with this change" are clear red flags. As a contrast, when the team has achieved full awareness, they won't let location be a barrier in effectively communicating ideas, or be a factor in whether or not they collaborate on a task. They become apt at explaining themselves and actively seeking feedback, thus consciously and proactively overcoming the limitations of the tools and the remote setting.

Initiative

While all of the above is good by itself, and while an active facilitator can help tremendously towards this end, we will need to have a team of self motivated individuals to really conquer location barriers. Individuals are the ones that carry out the necessary tasks to accomplish the team goals, and they are the ones who experience first hand the pains and the joys of getting things done. Unless we all are willing to step out of our comfort zones, seek new ways to make things better and overcome the daily challenges, and push each other to continuously improve, we won't have a chance in achieving our collective full potential.

I'll leave you with a quick tip: when choosing a mode of communication with a colleague, consider upgrading to a higher touch one. If you are about to send an email, how about instant message instead. While you are at it, won't a phone call be even better? But then, what's preventing you from having the discussion face to face? Oftentimes, we overestimate the risk of being disruptive, while underestimating the limitations of more passive forms of communication, like email. If you've experienced long winded email threads, going seemly forever, consider perhaps getting together in a room, virtual as it may be, to discuss things over. The dynamic will be vastly different, in a good way.

If we can’t all co-locate, we should try to come as close to it as we possibly can. We must be deliberate and disciplined in how we adapt our work to this new environment, maintain constant awareness of the situation, and demonstrate the initiative to challenge its limitations.

Tuesday, December 11, 2012

We Won't Fail, and It Won't Be Fast

"Let's try it, and fail fast" may not always be a wise approach.

We had an issue: our project was being asked to carry on a piece of work in a predetermined way. We had fundamental disagreements with the proposed approach. We thought it was inefficient, adds unnecessary complexity, and a lot of waste. However, we faced considerable pressure to go along, given certain organizational and political constraints.

The idea was coined: "Let's just try it, and fail fast, when it becomes clear it's really not adequate for the task at hand." As appealing as this might seem, it embodies considerable risk.

Let's say we did that. We proceed developing the application the way we are asked. We waste our time building this interface, integrating to this system, persisting transient, in-process data elements, etc. Now, at what point exactly do we fail fast?

Consider the following two observations:
  • We always want to make things work, and
  • Waste is not always acknowledged as such.
Even when we work in a less than ideal environment, we always want to make the best of it. Even when our team is asked to accomplish a task while unfairly constrained, we will work hard to reach the best outcome. We will make it work, whatever the cost might be. What we end up with is, still, far from ideal, but we will go the distance, because we don't like to fail.

When we then try to show the powers that be how much more complexity or waste there is, it's unlikely that it will be acknowledged as such. For example, if your team's productivity was hindered by a requirement to keep reams of documentations uptodate all the time at every step along the way, with no consumer of this information whatsoever, someone, likely from the powers that be circle, will rise to assert how much valuable that will eventually be.

The world of large enterprises today has lots of hidden inefficiencies, originating from silo-ed teams, and competing divisions. While we should always challenge this state, we should also understand when failing fast is not a realistic option.

Saturday, June 2, 2012

Profiling Lazy Evaluations

The point: Lazy sequence evaluations render the results of simple code profiling useless. Alternate techniques must be devised to correctly find code hot spots.

Recently, I coded a function in Clojure that wasn't performing fast enough. Admittedly, I'm new to the language, so I tried the techniques I'm familiar with to find where the function is spending most of its execution time. I used the time function, as well as the profile library from clojure.contrib, wrapping the pieces of code from the outside in, trying to close down on the slow parts. After some time, I was not getting any good information. It seemed that at some point, I lose any indication that code is spending any meaningful time executing the code being wrapped up.

That, until I finally noticed something very telling.

In an attempt to speed up a piece of code, I was checking to see of a collection is empty, before processing further. The code looked like this: (empty? coll), and it was taking one second to execute!

Obviously, something was wrong, and that was my understanding of how I can effectively profile this code. Since most of the underlying code was using lazy functions and sequences by default, the empty check caused a chain of delayed execution functions to activate. Well, mystery solved then, but how do I effectively profile this code to find where the time is being spent?

The most reliable measurements I got were from functions that are self contained: those that process all of its input collections to return a number, for example, or those that didn't process collections at all. reduce also exhausts collections. Calling count has a similar effect. I was able to use these techniques to enhance the reported time because I new that my code will process all the items in the underlying collections any way, so forcing the realization of the full collection won't change the overall time. But what if that wasn't the case?

So far, I don't have a great answer. For the time being, I'll resort to measuring the most elementary of operations in the code, as they provide the most reliable information.

Thursday, March 29, 2012

Instance Thrashing with Amazon EC2 Autoscaling

This post explains a situation our team encountered when we tried to use ec2 auto scaling, for a particular application. We didn't ended up using auto scaling. Instead, we allocated enough instances in advance to respond to anticipated load.

In a nutshell: Automatically added instances could not handle the high load, and were immediately removed from the pool, rendering auto scaling unusable.

The graph depicts CPU utilization for 30 ec2 large instances.
We started with 4 instances, allowing the group to grow to up to 30 instances. The load test was designed to gradually increase the load to the maximum expected, and then sustaining it at that level for a period of time.
The 4 instances behaved as expected, until the point where we experienced request timeouts. New instances were added to the group as expected, but we didn't observe an improvement in response time, or a reduction in dropped requests. This continued until the load was eventually reduced.
During the high load period, we kept querying for the number of healthy instances, and always found the number too low. The average was about 7.
It's not immediately apparent in the graph above, but it does show instances coming into the pool, at which point their CPU utilization peeks, then soon after, the same instances' CPU utilization going down.
We expected that with new instances added to the pool, that CPU utilization will be reduced across all instances. Since all instances are configured the same way, we expected more or less the same CPU utilization across the group. Also, we expected the response time to go down in a similar pattern.
The average response time stuck at 60 seconds, which is a signal that instances are dropping requests. The request timeout was set at 60 seconds, thus skewing the average to this number.
This behavior led us to conclude that our instances could not start up normally under a heavy load. Further investigation was certainly due, but we never got around to it. It was deemed safer to keep enough instances up to respond to the anticipated load, which we confirmed in a subsequent load test.

Thursday, July 22, 2010

A Story of a Software Project

Iteration 1: The team picks up the first stories, and makes good progress. The result is showcased to the customers. The mood is encouraging.
Iteration 2: The team churns a bit, trying to get the first stories closed and iron out some technology choices.
Iteration 3: Not enough stories are being closed. The team's velocity is lower than needed. The PM gets worried, and starts calling meetings and raising red flags. The PM declares that the team needs to catch up, and look for ways to increase velocity.
Iteration 4: The team makes good strides, and appears to be back on track. Nerves settle down a bit.
Iteration 5: The team's velocity is soaring. The PM says that given the current velocity, the team will meet its target date. The team starts taking care of technical debt.
Iteration 6: The business functionality is taking shape. The customers start to get a feel of how the system works. They start asking for modifications.
Iteration 7: The customers become more demanding. They notice some gaps between the functionality of the application, and what is needed to run the business. Defects start to creep in.
Iteration 8: The PM talks the business out of some of their demands, and the team devises workarounds for some outstanding issues.
Iteration 9: Faced with approaching deadlines, the PM asks developers to stay late and work over the weekends to finish the remaining tasks. The code quality suffers and technical debt increases.
Iteration 10: The team manages to finish all the remaining tasks. The application is put in production, with minor hiccups. Time to celebrate.


Does this sound familiar?
The team delivered on time. Is there a problem here?
The above pattern causes hardship for the team. The resulting code quality is rarely satisfactory. But we shouldn't be surprised or get overly worked out because of things that are really to be expected:
- Estimates are not always met, because they are estimates.
- The team takes more time in the first iterations because it's the first time this team tackles this problem.
- The customers don't like what they see the first time, because it's the first time they see it.
- The project is taking more time than expected because our expectations are just now being reality checked.
- The developers are being asked to work extra time because the team's management over-promised. However, the developers had no clue initially whether these promises can be met. Everything looked good on paper.

What's the way out of this?
This is not an easy problem. What makes it even more difficult is the fact that the team delivered after all, reducing the incentive for change. There are ways, however, to make things better:
- Educate all parties on all aspects of the project.
- Get the customers involved as early as possible.
- Manage all parties' expectations.
- Communicate regularly, and facilitate information sharing.
- Make it clear that the process of adaptation also includes dates and scope.
- Learn from the past. If you've seen this before, it's likely that you'll see it again unless you change your approach.