process

How much planning is enough?

Few years back, I was sitting in Bunker Coffee with Luke for our 1:1.

Luke tells me, “you know, no matter how hard you plan, the error rate of our decision doesn’t change much. If we take path A today versus next week, after having a whole lot more meetings, we still don’t know what we don’t know. What do you think?” He’s always polite to leave space for me to respond.

We were just coming from a team discussion on how to implement a feature. The team couldn’t arrive at a decision. Our most experienced devs wanted to spend more time thinking and white-boarding a solution.

Luke was right of course. We needed to move forward and discover unknowns. But when do you “just start” versus spend time in the drawing board?

Luke doesn’t just write code. He also builds houses. He’s worked through many building projects to know that there has to be a solid plan. However, software is far more malleable. Most decisions made in code can be changed. What truly matters is understanding dependencies in the work being performed. Dependencies can make altering decisions far more challenging.

Generally speaking, further down the stack you go, more upfront thinking is required. In other words, volatility in code and product increases as systems get closer to customers. Remember images like below that encouraged us to front load all issue tackling to save money? Well, they are wrong. Most solutions can do with making mistakes and validating assumptions sooner than later.

Image showing cost of fixing bugs at different stages of development. Fixing in production is 150x costlier.

Not everything fits this model

The feature we were debating was for a mobile app that didn’t have a large number of users at this stage. That is the second piece to consider: customer impact. For systems that have broad customer impact, more consideration needs to be taken. If millions of users are going to be affected by a certain identity verification system, perhaps more thought should have been given. This is why as products scale, feature development does take a little longer.

Even when there are changes that are high impacting, there are tricks to go faster. It is even more important to learn about unknowns and arrive at informed decisions in these products. A simple way to get there is to reduce scope of changes. Scope can be reduced by either limiting customer exposure or breaking the change down to small iterations.

If you are running a project, always ask your team how something can be delivered in half the time. You can’t bend time and space; things will take longer than predicted. But you can have achieve outcomes sooner than you think.

Dependencies and customer impact

Many tend to be very risk averse, especially in high revenue environments. Like I talked about in “Companies Grow” post, this can become an attribute of growing companies, even to the point of rejecting innovative ideas altogether. More planning becomes the answer to please this conservativeness.

Moreover, developers are always looking to develop the ideal system; they hate writing throwaway stuff. There is natural inertia towards being careful with every change.

We should always keep in mind that these actions have a price attached to it. Being agile fundamentally requires us to try things, learn from them, and try again. We are always wiser for having reworked our solutions with new insight. Mature teams understand this. They aren’t afraid to test solutions, validate assumptions and go back to the drawing board. They aren’t afraid to fail. Like Luke reminded me, there is just always a chance of failure.

Lastly, remember… (maybe this is a construction industry thing) …as my house builder points out, “a plan is just a starting point.” Actual work lays ahead of you.

How to drive your teams to care more about operational excellence?

When an operational outcomes needs to be achieved, we are after a predictable and repeatable process. This means that we can’t approach the execution of these outcomes like a creative task, allowing makers to self-organise. "OpEx" is an area where rigidity helps.

We can approach "Operational Excellence" on 2 fronts:

  1. Leverage an operating model where people must care for operational outcomes
  2. Help your team empathise with pains caused by operational failures

Here are a few ideas that’s worked in my team. (We are a group of ~60 makers.)

For operating model…

  • We assign leads for each area for each area of operational need. Their role is to understand current state, discover growing pains and bring opportunities to attention. Additionally, this creates buy-in and leadership opportunities in the team.
  • Alerts on production is a must. We always have someone on the pager to answer alerts. The paged person also has the ability to disengage anyone from their work to resolve production issues.
    • We keep “2AM playbooks” documented for quick help for anyone getting alerts on how to diagnose common issues and how to reach out to partner teams.
  • We talk about our operational stats once a week in a forum to provide wider visibility to the team and stakeholders. Being open about our failures and state of affairs help everyone get on the same page about investing in operational outcomes.
    • We present metrics for performance, availability, incidents, time to react and resolve, traffic levels, bug counts and SLAs.
  • We classify errors and incidents into buckets. For example, incidents and bugs that have significant $ impact should be critical (if customers can’t purchase from your store, it would be significant).
    • We break our errors to critical, major, standard
    • All logged errors carry a similar classification. Alert on % of categories of errors vs traffic and tune them based on severity.
    • We create SLAs for different categories of issues. When certain areas of application health is low, we set quarterly goals to lower them.
  • Automation helps keep incidents to a minimum. We invest in some of these efforts regularly. Our recent effort has been to put automation on accessibility issues!

On the cultural front, we do a few things too…

  • Most teams talk about how they value quality. For me, the talk has to be met with real investment to show the team “we mean business”. One of our recent changes is to allow directors the ability to vary investment into support by pausing/delaying other efforts. There is no need to get higher approval to invest in keeping the production environment low on errors.
  • We help the team understand business value of errors. Understanding revenue impact can change attitudes towards bugs. We try to also look back on previous quarter to quantify how much value was restored or added with our support and platform work to aid in this space.
  • Everyone makes a big deal out of operational wins. The PR campaigns to showing off wins in support and performance gains outweighs most other things in the team! Encouraging positive behaviour always helps.
  • In recent times, we’ve done a lot of work to make mundane things like support feel a lot more meaningful. Instead of focusing just closing out issues, we give engineers time to investigate root causes and fix system issues. Repeatedly resolving the same issues wears everyone down.

Ultimately, as a leader, we must ensure our team delivers on an operationally excellent environment for the business to thrive. However, one thing I try to watch out for is that we continue to have a team that is not afraid to make mistakes. Having some of the operational pieces set allow more fearless tinkering and innovation.

Attributing value to teams in large organizations

In a large tech company, there are layers and layers of teams that contribute to the success of a product or service offered. Many of these layers tend to only have indirect impact on customers. Teams that have a direct relationship with customers tend to have a lot of power. They bear the burden of delivering perceived value, in exchange for larger investments.

As organisations grow, responsibility of focusing on critical parts of a product gets split up and assigned to various teams. Many of these teams would would be in the middle to lower layers, not having a direction relationship with customers. Their contribution often go unchecked and under-appreciated, as their value isn’t easily captured and accounted for. This starts to erode their understanding of the value they provide. Focus shifts.

Let’s take an example of a team tasked with creating a data analytics library for a website. Collecting analytics on a product is critical to improving it. How important is it compared to contributions from the frontend or services teams? Should the analytics library team focus their efforts on accuracy, library performance, or perhaps, code quality?

As different parts of an organisation starts to think about value very differently, it becomes much harder to innovate and continue to be competitive. The product itself starts to suffer.

The key thing to work on here is “attribution of work” at every layer. If the business is making 100M a year, what percent of that is thanks to the analytics library? The relationship is not direct, and therefore, no obvious. But it is a worthwhile exercise to understand attribution.

Once attribution is understood, we can share growth targets based on that. Teams at each level can start to work towards a common goal. This facilitates collaboration among teams. Decisions like prioritising backlogs and staffing investments starts to become easier.

If the analytics library team understands that they will be at least 5% contributors to the 20M growth target for online sales, they can start to power features towards the growth target itself. They can start to ask questions like, “would innovating a new feature enable teams to evaluate and capture market segments that wasn’t understood before?”

Speaking a common language of value is a hard thing. Understanding exactly how “the janitor helps put a man on the moon” is tough too Doing these hard things can lead to a much more cohesive working environment and innovation.