How To Measure DevOps Success: The 4 Key Metrics To Know (DORA, 2025)

If you lead a delivery organisation, you have probably been handed a dashboard with twenty "DevOps metrics" on it. Most of them are vanity numbers. The signal that has survived a decade of scrutiny is narrow: four metrics, validated annually across tens of thousands of practitioners by Google's DORA research program, that predict both software delivery performance and organisational outcomes. This is a working brief on what those four metrics are in their current form, what good actually looks like in numbers, and the single biggest thing that has changed since 2023: AI now sits in the middle of your pipeline and is quietly reshaping these numbers, not always in your favour.

Diagram of engineering productivity pillars showing how the four DORA delivery metrics balance throughput against stability

Two things to internalise before the metric-by-metric walkthrough. First, the four metrics are deliberately split into two opposing pairs: throughput (deployment frequency, lead time) and stability (change failure rate, recovery time). You read them together. A team that ships ten times a day but breaks production on a third of those deploys is not high-performing, it is reckless. The research is consistent here: high performers excel across all four, and a team that is weak on one is usually weak on all four, because the same underlying practices (small batches, automated testing, fast feedback) drive every one of them (InfoQ, 2024 DORA analysis). Second, the terminology has moved. If you are still quoting "MTTR" in a steering committee, you are a version behind.

The four metrics, in their current form

1. Deployment Frequency

How often you successfully release to production. This is the cleanest proxy for batch size: teams that deploy frequently are, by necessity, shipping small changes, which is what makes everything downstream safer. The elite band is on-demand, many deploys per day; low performers sit between once a week and once a month (Octopus Deploy / DORA benchmarks). The trap is celebrating the number in isolation. Frequency is only meaningful when stability holds at the same time, which is exactly why you never report it alone.

2. Lead Time for Changes

The elapsed time from a commit landing to that commit running in production. This is your pipeline's true cycle time, and it exposes everything between an engineer's keyboard and the customer: review queues, manual approval gates, flaky test suites, release trains that only depart on Thursdays. Elite teams achieve less than a day, commit to production; low performers measure this in weeks or months (Octopus Deploy / DORA benchmarks). For regulated teams, lead time is where your change-management and segregation-of-duties controls become visible as latency. The goal is not to remove the controls, it is to automate the evidence so the control costs milliseconds, not days.

3. Change Failure Rate

The percentage of deployments that result in a degraded service requiring remediation (a hotfix, rollback, or patch). This is the stability counterweight to deployment frequency. The bands are concrete and worth memorising: elite is 5%, high 10%, medium 15%, and low performers run all the way up to roughly 64% (Octopus Deploy / DORA benchmarks). A change failure rate above 15% is rarely a tooling problem; it usually points to testing and review gaps that no amount of dashboarding fixes.

4. Failed Deployment Recovery Time

This is the metric most leaders still get wrong by name. DORA has retired "Mean Time to Recovery (MTTR)" and now reports Failed Deployment Recovery Time: specifically, how long it takes to restore service after a failed deployment, not after any incident (Octopus Deploy / DORA benchmarks; DORA metrics history). The scoping is deliberate, because it isolates the part of recovery you actually control through your release process. Elite teams recover in less than an hour; high performers in under a day. Note also that DORA groups this metric with throughput rather than stability, on the reasoning that fast recovery is itself a function of how quickly you can ship a fix (DORA metrics history). The practical takeaway is unchanged: invest in instant, no-questions-asked rollback. A one-click revert is worth more than any post-incident process document.

Beyond the four: Rework Rate and Reliability

The canonical four remain the industry standard, but treating them as the complete picture is now a mistake. DORA has added a fifth metric, Rework Rate, defined as the proportion of deployments that are unplanned fixes triggered by a production issue, as a complement to change failure rate. Where change failure rate captures the deploys that broke, rework rate captures the instability that surfaces afterwards as unplanned production work (Octopus Deploy / DORA benchmarks). DORA also tracks a broader Reliability dimension covering whether the service meets its own availability and latency targets (Octopus Deploy / DORA benchmarks). If your team is mature on the four, rework rate is the next number I would put on the board, because it catches a failure mode the original four miss: the slow drip of "we shipped it, then spent the next three days babysitting it."

The AI variable: an amplifier, not a fix

This is the part that has genuinely changed since the 2023 version of this advice, and the part most worth your attention. In the 2025 DORA report, drawn from nearly 5,000 respondents, 90% now use AI at work and over 80% report productivity gains, and for the first time AI's relationship with throughput turned positive (Google Cloud, 2025 DORA report). That is the good news, and it is real. The uncomfortable news is that AI continues to correlate with worse delivery stability: higher change failures and more rework (Google Cloud, 2025 DORA report). The 2024 report quantified the drag plainly: a 25% increase in AI adoption tracked with roughly a 1.5% drop in throughput and a 7.2% reduction in stability, alongside 39% of practitioners reporting little or no trust in AI-generated code (Google Cloud, 2024 DORA report). The mechanism is not mysterious: AI makes it trivial to produce larger changesets, which directly violates the small-batch principle that every one of the four metrics rewards (InfoQ, 2024 DORA analysis).

DORA's headline conclusion is the line to take into your next planning meeting: "AI doesn't fix a team; it amplifies what's already there" (Google Cloud, 2025 DORA report). A team with strong tests, fast feedback, and disciplined batch sizes converts AI's speed into safe speed. A team without those foundations converts the same AI into faster production incidents. This is precisely why you measure all four metrics rather than chasing velocity alone: the four are your early-warning system for whether AI is amplifying your strengths or your weaknesses. The report is equally clear that the largest ROI comes not from buying more tooling but from improving the underlying system, and it points to platform engineering as the key enabler. 90% of organisations have now adopted at least one internal developer platform, and that platform layer is what makes the small-batch, fast-recovery practices repeatable across teams (Google Cloud, 2025 DORA report).

How to use this

Instrument the four metrics directly from your deployment pipeline and incident tooling, not from a survey. Report throughput and stability side by side, every time, and add rework rate once the four are stable. Set targets against the published bands rather than inventing your own. And when an AI coding assistant lands in your stack, watch your change failure rate and rework rate for the following quarter as closely as you watch the productivity gains, because the second number is where the bill comes due. The metrics have not become more complicated since 2023. The environment around them has, and that is exactly why measuring them well matters more now than it did then. For how this connects to our broader delivery work, see our DevOps and Cloud engineering practice.