opex

How to drive your teams to care more about operational excellence?

When an operational outcomes needs to be achieved, we are after a predictable and repeatable process. This means that we can’t approach the execution of these outcomes like a creative task, allowing makers to self-organise. "OpEx" is an area where rigidity helps.

We can approach "Operational Excellence" on 2 fronts:

  1. Leverage an operating model where people must care for operational outcomes
  2. Help your team empathise with pains caused by operational failures

Here are a few ideas that’s worked in my team. (We are a group of ~60 makers.)

For operating model…

  • We assign leads for each area for each area of operational need. Their role is to understand current state, discover growing pains and bring opportunities to attention. Additionally, this creates buy-in and leadership opportunities in the team.
  • Alerts on production is a must. We always have someone on the pager to answer alerts. The paged person also has the ability to disengage anyone from their work to resolve production issues.
    • We keep “2AM playbooks” documented for quick help for anyone getting alerts on how to diagnose common issues and how to reach out to partner teams.
  • We talk about our operational stats once a week in a forum to provide wider visibility to the team and stakeholders. Being open about our failures and state of affairs help everyone get on the same page about investing in operational outcomes.
    • We present metrics for performance, availability, incidents, time to react and resolve, traffic levels, bug counts and SLAs.
  • We classify errors and incidents into buckets. For example, incidents and bugs that have significant $ impact should be critical (if customers can’t purchase from your store, it would be significant).
    • We break our errors to critical, major, standard
    • All logged errors carry a similar classification. Alert on % of categories of errors vs traffic and tune them based on severity.
    • We create SLAs for different categories of issues. When certain areas of application health is low, we set quarterly goals to lower them.
  • Automation helps keep incidents to a minimum. We invest in some of these efforts regularly. Our recent effort has been to put automation on accessibility issues!

On the cultural front, we do a few things too…

  • Most teams talk about how they value quality. For me, the talk has to be met with real investment to show the team “we mean business”. One of our recent changes is to allow directors the ability to vary investment into support by pausing/delaying other efforts. There is no need to get higher approval to invest in keeping the production environment low on errors.
  • We help the team understand business value of errors. Understanding revenue impact can change attitudes towards bugs. We try to also look back on previous quarter to quantify how much value was restored or added with our support and platform work to aid in this space.
  • Everyone makes a big deal out of operational wins. The PR campaigns to showing off wins in support and performance gains outweighs most other things in the team! Encouraging positive behaviour always helps.
  • In recent times, we’ve done a lot of work to make mundane things like support feel a lot more meaningful. Instead of focusing just closing out issues, we give engineers time to investigate root causes and fix system issues. Repeatedly resolving the same issues wears everyone down.

Ultimately, as a leader, we must ensure our team delivers on an operationally excellent environment for the business to thrive. However, one thing I try to watch out for is that we continue to have a team that is not afraid to make mistakes. Having some of the operational pieces set allow more fearless tinkering and innovation.