Operational Excellence

4 min readJan 30, 2024

An image that shows cogs and wheels to represent operational work. — Created using https://designer.microsoft.com/

Introduction

This article is a continuation of my series on developing and improving engineering capability within an organisation. In this article I propose my definition of Operational Excellence and its key constituent parts. I also propose a framework that can be used to measure the maturity of an Engineering team’s operational practices.

Chronological Reading In this Series

Setting up high-performing, autonomous Engineering Teams and their key traits — https://nik-gupta.medium.com/developing-and-improving-engineering-capability-1dbac4caa629
The Art of Engineering Excellence and a framework to measure your team’s maturity — https://nik-gupta.medium.com/engineering-excellence-793531c99df0
The Art of Operational Excellence and a framework to measure your team’s maturity — https://nik-gupta.medium.com/operational-excellence-51a16f183474

Defining Operational Excellence

As with Engineering Excellence, there are many interpretations and definitions of Operational Excellence (OE). I define OE as

A set of best practices and tools that the teams adopt to consistently and efficiently deliver high quality, secure software to their customers while minimising cost of delivery by continuously optimising resources, eliminating waste and improving their security posture.

The set of best practices can be aligned to the following themes

Change Management
Monitoring
Resiliency
Incident Management
Security and Risk Management

Within each of these, the teams can define their own frameworks, ways of working and evaluating models for them to measure their maturity with respect to engineering excellence.

Questions that Simulate Great Debates

Consider answering some of these questions to probe deep into your current practices and what would truly bring most benefit to your organisation and teams.

What is your mechanism to measure your security and risk posture? Do you track open vulnerabilities across your systems and patch them regularly?
Does the team run an operational review forum where they measure their incident and exception trends and take action to drive continuous improvements across them?
How much of the team’s available bandwidth is invested in activities that do not directly create new capabilities (bug fixing, incident management, production support etc.)?¹
Does the team regularly review their cost of building, running and delivering software to their customers (infrastructure cost, software licensing etc.) and do they undertake initiatives to ensure they reduce the cost in real terms with time?
Do they have a set of well published and advertised monitoring dashboards that help them track their system’s normal operations and spot anomalies?
Are these dashboards set up to provide useful drill downs quickly in the event of incidents?
Does the team have a defined on call rota with severity based alerting mechanism to ensure all critical incidents are acted on within their prescribed SLAs?
How many of their deployments result in incidents, what is the lead time to recognize these incidents and finally what is the time to resolve these incidents? Are these being discussed in their OE forum and can the team demonstrate an improvement in their trend?
Does the team run regular game days or load testing days to improve the performance of their software and ensure high availability during peak times?
What is the protocol in the team to ensure Continuity of Business and restoration from back ups if needed?

A framework for Operational Excellence

For convenience, I am providing a basic framework that attempts to categorise teams into 5 possible maturity levels (5 being the most mature) depending on the practices they follow within each of the themes mentioned earlier in this article.

Use this framework as the starting point and adjust/ delete/ add whatever works best for your unique DNA to help craft your version of Operational Excellence and the criteria to evaluate the maturity of your teams.

Given Medium’s inability to work with tables, the below is a screenshot of the framework. Please access the editable version here — https://docs.google.com/document/d/1d5k1c2LYU1bnl2wL6xa1ZD5ru01oiD_z/edit?usp=sharing&ouid=117333362739476372488&rtpof=true&sd=true

https://docs.google.com/document/d/1d5k1c2LYU1bnl2wL6xa1ZD5ru01oiD_z/edit?usp=sharing&ouid=117333362739476372488&rtpof=true&sd=true

Summary

Operational Excellence is a set of processes, tools and mental models that reduces the operational cost of delivering secure software. These processes are usually summed up in the 6 themes presented earlier in the blog post. Please use the provided framework as a starting point to probe deeply into what works best for your organisation. If you have comments or feedback please do not hesitate to reach out to me on Linked In — https://www.linkedin.com/in/nik-gupta/

¹ As a rule of thumb if a team is spending more than 20% of their available bandwidth on these activities they need an intervention to course correct.