Wednesday, February 4, 2015

A Story of Communication

We had a pretty good offshore team! Our team members who have been on rotation to the remote site all came back very impressed. The offshore team was talented, knowledgeable, and smart. One had to wonder how come our client didn't think very highly of them!
During one sprint, our offshore team had signed up for a story that was prominent on the client's radar. They started by reviewing the story notes, huddling on the technical design, and coding away the solution.
Meanwhile, the client team was happily receiving updates during standup, and sending messages of support and encouragement.
Soon, the team started getting some feedback from the solution they implemented. The results weren't promising. It seems they were heading to a dead end. The team convened and decided on an alternate approach. Diligently, they started fixing the problems right away.
The fix took a longer than anticipated, but the team can now see the end in site. They doubled down their efforts to get the story done.
On another continent, the client was getting super nervous. The sprint was drawing to a close, and they still had a story in flight. It would look bad on their reports if we missed this story. They started inquiring whether we think we'll be done on time.
Our BA initiated many conversations with the offshore team. Trying to understand what's holding them back. The team explained where they are, and the message went back to the client: we'll get it done on time.
The team ended getting the story done in time, but with elevated levels of stress. They felt the client didn't allow them enough breathing room to get their tasks finished, and caused them to waste time and energy on useless meetings, instead of the needed focus on the task at hand.
The client, on the other hand, was displaying signs of loss of confidence, and even began to wonder whether the offshore team was incompetent.

What went wrong?
You may have concluded while reading this story that:
  • The offshore team was not being transparent enough, by not sharing their status earlier with the client.
  • The client wasn't proactive enough, by asking more probing questions.
  • The offshore team was just trying to save face: why send a message of being late, while they could, soon enough, they believed, send the message: "we are done."
  • The client wasn't supportive enough. They should have offered to help sooner.
  • It's all because of cultural differences.

The answer is a little bit of all of the above. A direct consequence of working in a distributed setting is a lower quality of communication. None of us is perfect. We are bound to make mistakes, and something will always get lost in translation.

What's the way out?
It's easy enough to say "over-communicate," but that doesn't give enough guidance as to what we should do. Here are some concrete advice that may help:
  • Build trust with the offshore team. Make sure you are seen as an ally. If the offshore team perceives they have lost your trust, they will be that much less likely to share potentially bad news.
  • Seek opportunities for communication outside normal work sessions. It's amazing how much work related knowledge you can attain during a non-work setting. You get better context, better understanding of the team dynamics, and learn better ways to communicate fruitfully.
  • Spend face time with the offshore team. Connections you build on a personal level will help immensely when you're physically present.
  • At the same time, avoid at all cost attempting to micro-manage remotely. Not only you'll lose trust, you'll also slow the team down, introduce process friction, and become a bottleneck.
A better process flow in a distributed setting starts with building awareness of the situation. Start with the mindset that you are already losing information, and build mechanisms to counter.

Friday, April 4, 2014

Word Request: Email is Failing to Advance our Understanding

So, here is the situation: we are having a discussion over email, and the discussion is now exhibiting the following characteristics:
  • It is getting nowhere.
  • Everyone seems determined in their ways.
  • We are pretty much talking over each other.
  • It’s as if we are communicating over different wave lengths.
We need to recognize the situation, and put an end to this.

And therein lies my humble request: a word that someone responds with that identifies the situation and calls for continuing the discussion over another medium.

I’ve seen the pattern above so many times that I believe labeling it will help us all communicate better. Here are some wordy descriptions that may be used:
  • Continuing this discussion over email is harmful.
  • Continuing this discussion over email will cause more harm than good.
  • This discussion thread has reached a point of negative return.
  • We have reached a point where we should continue this discussion using a different medium.
  • Email is not the best way to continue this discussion.
  • Email is failing to communicate the point.
  • Email is failing to advance our understanding.
  • Email fails to advance the subject.
Yes, we can make an acronym, but down with acronyms. The world doesn’t need another EFAU, or EFAS. Let’s keep that as a last resort.

Any ideas?

Tuesday, March 11, 2014

How to Make Feature Teams Work for You

Let's say that I'm a CTO, and my IT program management is a mess. I can't allocate budget based on business priorities and company-wide initiatives. So I decide to unify all my IT force into a single pool of resources, and divide them into short-lived feature teams, formed around my current priorities. As my priorities change, I dissolve some teams and form others. And it is like a dream come true: now I can allocate my budget easily, have predictability into my spend, and avoid wasting money into projects that are not aligned with my goals. Or is it?

Let’s look at this a bit in detail. Feature teams will either work on existing assets, or develop new ones, and perhaps decommission old ones in the process. Now, what will be the state of these assets while feature teams work on delivering to their respective initiatives?

Consider the following analogy. Let's say that every one of my IT assets, be it an application or a backend service, is an airplane. Some teams are building new airplanes. Other teams are adding capabilities to existing airplanes. Some airplanes are in-flight, delivering business value. Some airplanes require more maintenance than others. But each one has its specialized flying instructions and maintenance procedures. Consider my feature teams to be the flying and maintenance crews. Based on my business priorities, I assign crews to airplanes; building new ones, adding capabilities to existing ones, or decommissioning and replacing some, or part thereof, with others, or parts thereof. All the while keeping all airplanes flying.

No, wait, who is keeping the airplanes flying? That is not a business imperative of any of the feature teams. An implicit assumption, maybe? Is there a separate maintenance crew to keep all airplanes flying? How are we going to ensure that that feature teams will respect the flying procedure of each airplane?

And how are we going to ensure that feature teams will not step on each other's toes? How do we ensure that the flying integrity of each airplane is not compromised?

If someone would tell me that this is similar to the model popularized by Spotify, then I would respectfully beg to differ. And in short, we can't simply wish for all the benefits of feature teams, while ignoring all the challenges above.

Let's agree on one thing first: I should be able to shift some of my budget and resources to respond to my current priorities. And at the same time, I should keep all my assets healthy, and all my airplanes in a perfect flying condition.

How can achieve both these goals?

It is imperative that I keep a focused, long lived team around every one of my assets. The size of this team will depend on the complexity of each asset. We can even have a team be responsible for multiple assets.
In addition to these asset teams, we can still have our feature teams. But if a feature team would like to modify an asset to deliver a business need, they will have to embed into, or extend, the core asset team. This extended team will have to work in harmony to preserve the integrity of the asset, ensure the compatibility of the new implementation with the existing architecture, and alignment with the long term goals for that asset. Any changes will need to be approved by the core team. If they need to change multiple assets, then they'll have to divide up, or work on one at a time. There could be multiple feature teams extending the same core team for one asset. If this is the case, or if there is an asset that undergoes continuous changes by multiple feature teams, consider expanding this asset's core team.

This way, you'll ensure that all your airplanes are in a sound flying condition, and that none will go crashing on you, because of the lack of maintenance, or because different crews trying to pull them in different directions.

Friday, January 17, 2014

Successful Distributed Development: Discipline, Awareness, and Initiative

For many of today's IT organizations, distributed development is not optional. Rather it is a fact of life. We should be mindful of the limitations introduced by this mode of operation, while working constantly to mitigate them. It is a challenge that we have to actively tackle. Focusing solely on the tools misses the opportunity for true collaboration. Tools should be considered an enabler; we won't be a able to work remotely without tools. But having the tools, by itself, doesn't guarantee success.

It's important for us to understand what the ideal is, so that we can strive to come as close as possible to achieve it, given whatever constraints our work environment imposes on us. There is no substitute for face to face interaction. There is nothing better than a co-located, cross-functional team. While we design our work activities, we should strive to converge as much as possible to these ideals.

I'm guessing all of this is not new to you. Yet time and again, distribute projects suffer from communication breakdowns, misunderstandings, unmet expectations, among many other dysfunctions, often resorting to heavy processes that only make things worse. There are successful distributed teams, however. Below are few of the traits exhibited by those teams. Adopting these traits in your distributed environment can help you converge more to the ideal.

Discipline

I often hear project leaders complain that their teams are not working very well together despite having state of the art telecommunication tools. It is valid to question the tools adequacy, ease of use, usability, etc. It is more important to observe when, how, and even if, the team is using them. Compare your current situation to that of a co-located team. Play a what if scenario: what if everyone had been working in the same room?
You will need a dedicated facilitator, a catalyst of sorts, in every one of your locations, to keep nudging people to reach out to each other. It is not fair to simply expect everyone to remember to get out of their immediate challenge to seek help, or to solicit a different perspective. Software development is an intellectually demanding profession, and it is not uncommon for people to get consumed by their immediate task, and forget to reach out. There are a set of dynamics that can only happen if the whole team is co-located. You could hear a couple of coworkers arguing about a problem you solved yesterday, or perhaps solving a problem you'll face tomorrow. You could have just come out from a planning session, with a new understanding of the product vision, and you may just share that with your colleagues. And there is where the team's facilitator role comes in: to play the above what if scenario. How about we tell the other offices about this? How about we consult with other locations to see if they encountered something similar? But it doesn't stop there.

A disciplined distributed team will embrace certain values and adhere to a set of practices that ensure that the whole team operates as a single, cohesive unit. It's easy for us to fall back to our comfort zones, or be consumed by tasks, than to always remember to reach out. The facilitators role is to make sure that the distributed team never misses an opportunity to act as a co-located one, whenever possible.

Awareness

Let's face it, we don't naturally know how to effectively work in a distributed environment. We may know we have teammates in other locations. We may be curious how they spend their days, wanting to learn the challenges they face, and looking forward to meeting them in person. This doesn't mean that this knowledge will be translated into changing how we perform our day to day work to adapt and address the limitations of this setting.

Well, our designated facilitator is there to change this reality. By actively seeking opportunities for cross-site collaboration, events, and feedback sessions, they continuously raise the whole team's awareness that a distributed setting requires a different mode of operation.

It's only when all the team members are fully aware, that location barriers start coming down. There are certain signs you can watch for to assess whether the team has reached such a level. For example, if you hear statements like "let's wait until you are here next week to discuss this", "I couldn't really explain myself to the other team over the video conference", or "I couldn't really tell if they were happy or mad with this change" are clear red flags. As a contrast, when the team has achieved full awareness, they won't let location be a barrier in effectively communicating ideas, or be a factor in whether or not they collaborate on a task. They become apt at explaining themselves and actively seeking feedback, thus consciously and proactively overcoming the limitations of the tools and the remote setting.

Initiative

While all of the above is good by itself, and while an active facilitator can help tremendously towards this end, we will need to have a team of self motivated individuals to really conquer location barriers. Individuals are the ones that carry out the necessary tasks to accomplish the team goals, and they are the ones who experience first hand the pains and the joys of getting things done. Unless we all are willing to step out of our comfort zones, seek new ways to make things better and overcome the daily challenges, and push each other to continuously improve, we won't have a chance in achieving our collective full potential.

I'll leave you with a quick tip: when choosing a mode of communication with a colleague, consider upgrading to a higher touch one. If you are about to send an email, how about instant message instead. While you are at it, won't a phone call be even better? But then, what's preventing you from having the discussion face to face? Oftentimes, we overestimate the risk of being disruptive, while underestimating the limitations of more passive forms of communication, like email. If you've experienced long winded email threads, going seemly forever, consider perhaps getting together in a room, virtual as it may be, to discuss things over. The dynamic will be vastly different, in a good way.

If we can’t all co-locate, we should try to come as close to it as we possibly can. We must be deliberate and disciplined in how we adapt our work to this new environment, maintain constant awareness of the situation, and demonstrate the initiative to challenge its limitations.

Tuesday, December 11, 2012

We Won't Fail, and It Won't Be Fast

"Let's try it, and fail fast" may not always be a wise approach.

We had an issue: our project was being asked to carry on a piece of work in a predetermined way. We had fundamental disagreements with the proposed approach. We thought it was inefficient, adds unnecessary complexity, and a lot of waste. However, we faced considerable pressure to go along, given certain organizational and political constraints.

The idea was coined: "Let's just try it, and fail fast, when it becomes clear it's really not adequate for the task at hand." As appealing as this might seem, it embodies considerable risk.

Let's say we did that. We proceed developing the application the way we are asked. We waste our time building this interface, integrating to this system, persisting transient, in-process data elements, etc. Now, at what point exactly do we fail fast?

Consider the following two observations:
  • We always want to make things work, and
  • Waste is not always acknowledged as such.
Even when we work in a less than ideal environment, we always want to make the best of it. Even when our team is asked to accomplish a task while unfairly constrained, we will work hard to reach the best outcome. We will make it work, whatever the cost might be. What we end up with is, still, far from ideal, but we will go the distance, because we don't like to fail.

When we then try to show the powers that be how much more complexity or waste there is, it's unlikely that it will be acknowledged as such. For example, if your team's productivity was hindered by a requirement to keep reams of documentations uptodate all the time at every step along the way, with no consumer of this information whatsoever, someone, likely from the powers that be circle, will rise to assert how much valuable that will eventually be.

The world of large enterprises today has lots of hidden inefficiencies, originating from silo-ed teams, and competing divisions. While we should always challenge this state, we should also understand when failing fast is not a realistic option.

Saturday, June 2, 2012

Profiling Lazy Evaluations

The point: Lazy sequence evaluations render the results of simple code profiling useless. Alternate techniques must be devised to correctly find code hot spots.

Recently, I coded a function in Clojure that wasn't performing fast enough. Admittedly, I'm new to the language, so I tried the techniques I'm familiar with to find where the function is spending most of its execution time. I used the time function, as well as the profile library from clojure.contrib, wrapping the pieces of code from the outside in, trying to close down on the slow parts. After some time, I was not getting any good information. It seemed that at some point, I lose any indication that code is spending any meaningful time executing the code being wrapped up.

That, until I finally noticed something very telling.

In an attempt to speed up a piece of code, I was checking to see of a collection is empty, before processing further. The code looked like this: (empty? coll), and it was taking one second to execute!

Obviously, something was wrong, and that was my understanding of how I can effectively profile this code. Since most of the underlying code was using lazy functions and sequences by default, the empty check caused a chain of delayed execution functions to activate. Well, mystery solved then, but how do I effectively profile this code to find where the time is being spent?

The most reliable measurements I got were from functions that are self contained: those that process all of its input collections to return a number, for example, or those that didn't process collections at all. reduce also exhausts collections. Calling count has a similar effect. I was able to use these techniques to enhance the reported time because I new that my code will process all the items in the underlying collections any way, so forcing the realization of the full collection won't change the overall time. But what if that wasn't the case?

So far, I don't have a great answer. For the time being, I'll resort to measuring the most elementary of operations in the code, as they provide the most reliable information.

Thursday, March 29, 2012

Instance Thrashing with Amazon EC2 Autoscaling

This post explains a situation our team encountered when we tried to use ec2 auto scaling, for a particular application. We didn't ended up using auto scaling. Instead, we allocated enough instances in advance to respond to anticipated load.

In a nutshell: Automatically added instances could not handle the high load, and were immediately removed from the pool, rendering auto scaling unusable.

The graph depicts CPU utilization for 30 ec2 large instances.
We started with 4 instances, allowing the group to grow to up to 30 instances. The load test was designed to gradually increase the load to the maximum expected, and then sustaining it at that level for a period of time.
The 4 instances behaved as expected, until the point where we experienced request timeouts. New instances were added to the group as expected, but we didn't observe an improvement in response time, or a reduction in dropped requests. This continued until the load was eventually reduced.
During the high load period, we kept querying for the number of healthy instances, and always found the number too low. The average was about 7.
It's not immediately apparent in the graph above, but it does show instances coming into the pool, at which point their CPU utilization peeks, then soon after, the same instances' CPU utilization going down.
We expected that with new instances added to the pool, that CPU utilization will be reduced across all instances. Since all instances are configured the same way, we expected more or less the same CPU utilization across the group. Also, we expected the response time to go down in a similar pattern.
The average response time stuck at 60 seconds, which is a signal that instances are dropping requests. The request timeout was set at 60 seconds, thus skewing the average to this number.
This behavior led us to conclude that our instances could not start up normally under a heavy load. Further investigation was certainly due, but we never got around to it. It was deemed safer to keep enough instances up to respond to the anticipated load, which we confirmed in a subsequent load test.