Introduction

"Tell me how you measure me and I will tell you how I will behave. If you measure me in an illogical way … do not complain about illogical behavior.” -- Eliyahu Goldratt

Fundamentally changing how people work is hard. Metrics are useful to drive desired behaviors, but they can also be easily gamed. Without a comprehensive strategy to connect daily work habits to overall business outcomes, most organizations will fail to drive the types of change necessary to survive in today’s competitive marketplace.

Why Metrics Are Difficult to Get Right#

Operational maturity doesn’t happen overnight—it takes continued commitment, focus, learning, and practice. According to IDC, on average, 3.7% of an enterprise organization’s revenue is spent on digital transformation initiatives[]. The goals of transformation initiatives typically include outcomes like achieving greater profitability and market share by becoming a faster, more nimble, and more secure organization that can better serve the needs of demanding customers in today’s competitive marketplace. To achieve that, organizational transformation initiatives typically require very large investments, long-term strategy, and executive support.

Creating high-performance teams by adopting modern software delivery and operational practices often requires reorganizing technical teams, training them to work with new technologies that have different workflows, and changing existing processes. Adjusting to these changes can be tough and time consuming as teams figure out how to modernize their practices. At the same time, executive sponsors want to see quantifiable results. That creates pressure on managers to figure out a way to quantify something as intangible as whether a team is improving their technical aptitude.

Addressing the need for meaningful quantifiable measures requires a holistic approach to accounting for what an organization intends to accomplish. But without a clear strategy for doing that, it’s easy to focus on individual performance and not see the forest for the trees.

A common strategy for managers new to restructuring their teams is to demonstrate quick wins by measuring the first few high-impact metrics they can identify. For example, when restructuring teams into a full-service ownership model, measuring incident response metrics like Mean-Time-To-Acknowledge (MTTA) and Mean-Time-To-Resolve (MTTR) can quickly demonstrate the value of a new operating model. It’s motivating to get points on the board early in the game —however, by itself, it isn’t the same as having a long-term winning strategy.

Though it’s a good strategy for getting started, it would be a misstep to set individual goals around the same types of metrics. For example, a common next step is wanting to set goals around the number of incidents reported in production. Some managers reason that if their teams are indeed improving their technical aptitude and building more stable services, it would follow that there should be fewer incidents in production as a result. Therefore, they will then set goals around reducing the number of overall incidents in production. Not only is the fundamental assumption flawed (technical aptitude and incident frequency aren’t correlated—incidents are accidents, not intentional actions) but the incentives created are inadvertently dangerous.

If a team is only measured by the number of incidents reported, they are then incentivized to behave in ways that create a record of fewer incidents. The message to the team is clear: management wants fewer incidents. It doesn’t take an act of malice to then start identifying ways to cleverly reduce that number. Actions like lowering incident reporting thresholds or grouping non-similar incidents could be rewarded by having such a myopic approach to measuring success. The goal was to incentivize creating more stable services. But by solely focusing on the number of incidents, the measure fell short of ensuring the right comprehensive set of behaviors across every team that contributes to running those services in production.

A more holistic approach would be to measure and track metrics like the number of incidents, but also set goals around meeting service-level objectives (SLOs)—a combined measure of whether a system is running sufficiently reliably and what design or architectural changes should be made to ensure its continued reliability. Agreeing on SLOs requires organizational alignment between technical and business teams, and it can be difficult to build that alignment without a clear approach. So some teams often will settle for metrics that are more easily measured but ultimately counterproductive.

Operational Reviews#

Operational Reviews are a series of regularly scheduled activities that offer Engineering, IT, and Operations leaders the opportunity to understand the impact of their real-time operations on business outcomes. Operational Reviews are metric-driven activities that rely on quantitative data ahead of qualitative data. That data, along with the analysis and discussion among those involved, is meant to highlight the impact of technical service operations on key facets of a digital organization; namely people, costs to the business, and customers.

When undergoing digital transformation, one way to ensure that your teams are helping drive positive change is to create a cadence where they review the impact of their operations on business outcomes so they can make data-driven decisions that align all stakeholders on how to invest in the future. Running regular operational reviews allows you to optimize for the health of your people, the costs of running your digital business, and the customer experience. It keeps a pulse on qualitative change through quantitative data.

Operational reviews help you track metrics that indicate what’s happening at the technical team level on a daily basis, tie those metrics to the scope and impact they have on the business on a monthly basis, and use that to determine which investments into technical service reliability and process improvements can help the business reach its goals.

Getting Started#

The process in this guide is the result of PagerDuty’s work with more than 10,000 customers that represent all different types of digital businesses and demonstrate various levels of real-time operational maturity. Through those relationships and interactions, we learned a few things.

First, we learned that the most mature digital organizations run regular operational reviews to align leadership on the right investments to make in technology operations to deliver successful business outcomes. Next, as we practiced this and shared our findings with customers, we were asked for guidance to share what we do and what our most mature customers do. Last, many of our internal projects have shown us there are clear and quantifiable ways to drive the transformation of real-time operations.

The documentation that follows in this guide offers a methodology to help you get started in running Operational Reviews for your organization. The method is intended to serve as a starting point that can help many organizations get started in practical ways that not only create visibility for what happens at a team level, but also helps them create impactful operational reviews by understanding what to review, when to review it, who to involve, what data is needed, and even how to run the meetings.

Info

Not all of the guidance provided here will be applicable for every organization, and varying levels of real-time operational maturity will impact how hard or easy it is to follow this guide. The intent of this documentation is not to prescribe a one-size-fits-all approach to running operational reviews; rather, it’s to provide a framework for any digital organization to adopt and make their own, and allow them to refine, expand, and improve the review process as the organization's operational maturity increases.

Highly mature digital organizations run a variety of reviews suited to the needs of people at each level. Although these reviews are different at every company, there are commonalities that are distilled in this guide. These reviews each align with a specific layer of organizational management, cover an intentional timeframe and scope, and seek to answer a concise set of specific questions. When practiced methodically and in tandem, these reviews work together to create focus and alignment across the organization.

The three commonly shared reviews at these organizations are: a weekly on-call review, a monthly service review, and a quarterly business review. The timing of these reviews is deliberate and common across organizations. Different impacts are understood in different timeframes. For example, it’s hard to understand the impact of large investments or initiatives until weeks or months have passed. But understanding the impact that an active on-call shift had on your teams requires fresh information and perspective on a weekly time scale. To get started, you should aim to create these three types of reviews at a minimum.

Tip

If you come across any words you don’t recognize at any point in our guides, take a look at the Definitions section.