The Reliability Tax in Telecom Delivery: Shipping Faster Without Dropping the Call

The Reliability Tax in Telecom Delivery: Shipping Faster Without Dropping the Call

If you run engineering for a communications platform, you live inside a contradiction. The business wants more features, faster, on a network it cannot afford to break. Your SLAs are not marketing copy; for emergency-call paths they are a legal and human obligation. And the tooling that promises to make you faster is, on the current evidence, just as likely to make you less reliable. The job is no longer choosing between speed and stability. It is engineering a delivery system where the two stop trading against each other. This piece walks through four pressures that define that system for telecom and large-scale service platforms, each grounded in 2025 data, and what a disciplined response looks like.

AI is an amplifier, not a fix, and your foundations decide which way it amplifies

The 2025 DORA research on AI-assisted software development is blunt about the mechanism. AI adoption is now near-universal: 95% of developers use AI tools and over 80% report productivity gains, yet the same data shows an "Acceleration Whiplash" in which more code arrives with less scrutiny. The same analysis records median time in pull-request review up 441%, 31% more PRs merging with no review at all, bugs per developer up 54%, and incidents per PR up 242.7%. More code, arriving faster, scrutinized less.

The decisive variable is not the model; it is what the model is poured into. AI amplifies whatever delivery system already exists. Teams with strong testing, review discipline, and loosely-coupled architecture convert AI output into throughput. Teams with weak foundations convert it into technical debt and process chaos at higher velocity. Telecom organizations running mission-critical, tightly-coupled legacy stacks sit squarely in that second profile. A naive "use AI, ship faster" mandate pushed onto an OSS/BSS estate that was never designed for rapid, independent change does not accelerate delivery. It raises change-failure rate against exactly the reliability your SLAs depend on. The leadership move is to treat AI as a forcing function for the foundations: invest in automated testing, enforced review, and progressive delivery first, so that the amplifier has something worth amplifying.

In a tightly-coupled network, a routine change is a blast radius

The reason reliability dominates this domain is that the failure modes are not abstract. The 2025 Optus emergency-calling outage is the canonical recent case, and it is worth studying as a pure SDLC failure rather than as a telecom curiosity. An 18 September 2025 firewall upgrade caused roughly 600 failed Triple Zero emergency calls over about 13 hours across four Australian states, with four deaths reported during the outage, and Optus stated its monitoring system excluded Triple Zero calls so "there were no alarms".

Decompose that and every element is a familiar engineering control that was absent or weak. A routine change deployed without blast-radius containment. No canary or progressive rollout to catch the regression on a small slice of traffic before it reached the whole network. Monitoring that did not cover the single most critical path, so mean-time-to-detect stretched to hours on a system where seconds matter. And, by implication, no tested rollback that would have reverted the change the moment something looked wrong. None of these are exotic. They are the standard disciplines of change management and observability, applied to a tightly-coupled system where the cost of skipping them is measured in lives and government inquiries rather than dashboards. For a communications engineering leader the lesson is uncomfortable but actionable: the controls that feel like overhead on a routine firewall change are precisely the controls that exist for routine changes, because in this domain the routine ones are what kill you.

Operating two worlds at once is taxing your SRE capacity to the bone

The structural reason these controls are hard to apply consistently is that telecom is not migrating off legacy; it is running legacy and cloud-native in parallel, indefinitely. Monolithic OSS/BSS and circuit-era network functions are moving onto cloud-native 5G stacks built on microservices, Kubernetes, and multi-access edge. Teams operate both at once, which roughly doubles the operational surface. Traditional monitoring is part of the problem: tooling built for circuit-switched networks and monolithic OSS cannot handle 5G standalone, distributed edge, and microservices-based BSS, which is exactly the kind of gap that lets a critical path go unmonitored.

The human cost shows up in toil. On the network side the work is dominated by manual ticketing and "swivel-chair" provisioning, multivendor tool fragmentation, and non-standardized "snowflake" infrastructure that is hard to automate. Unmanaged toil is therefore not only a throughput tax; it is a retention and burnout risk, and burned-out engineers operating snowflake systems under time pressure are how change-failure rate climbs. The correction is to shift guardrails down into a self-service golden path rather than dumping operational load onto individual developers. The platform engineering response is real but immature: in one 2025 survey 29.6% of platform teams do not measure success at all, and 55.9% of companies now operate more than one platform. Building a platform is not the same as building one that demonstrably reduces toil.

NIS2 turns reliability evidence into a board-level, personally-enforced constraint

The final pressure removes any option to treat this as best-effort. Under the EU NIS2 Directive, telecom and digital infrastructure are in scope regardless of size, with a 24-hour early warning, 72-hour assessment and one-month final incident report mandated, fines up to EUR 10 million or 2% of global annual turnover, and management bodies "personally accountable for compliance," against a transposition deadline of 17 October 2024. For Benelux and wider European operators this is now binding law, not guidance.

Read those clocks as engineering requirements. A 24-hour early warning is impossible without observability good enough to detect and characterize an incident quickly, which is exactly the gap the Optus outage exposed. A one-month final report is impossible without audit-ready change records, supply-chain traceability, and tested incident-response runbooks already wired into the pipeline rather than reconstructed afterward. Personal accountability for management means the resilience evidence has to be a byproduct of how you deliver, not a document assembled under deadline pressure. NIS2 is what fuses the other three pressures into a single mandate: move fast, stay reliable, and prove it, simultaneously, with names attached.

What good looks like

The teams that hold all four constraints at once tend to share a small set of disciplines rather than a particular toolchain:

  • Progressive delivery by default for network and service changes, so the unit of risk is a canary slice, not the whole estate, with automated, tested rollback wired to the same signals.
  • Critical-path observability that is verified, not assumed, including synthetic monitoring of emergency and high-value call paths, so a silenced alarm is itself an alertable condition.
  • Guardrails shifted down into a golden path, so review, security checks, and compliance evidence are produced by the platform automatically rather than depending on individual engineers under load.
  • Measured platform outcomes, tracking toil, cognitive load, rework rate, and change-failure rate, because a platform nobody measures is a cost center, not a control.
  • Compliance evidence as a pipeline artifact, with change records and incident runbooks generated as you ship, ready to satisfy a 24/72-hour clock without a scramble.

None of this is exotic, and that is the point. The pressures on a communications platform in 2025 are AI amplifying instability, tightly-coupled change blowing up in production, toil draining the people who keep the network alive, and regulation making the whole thing personally accountable. What connects them is that each is a delivery-discipline problem dressed as a technology problem. A platform that encodes progressive delivery, verified observability, shifted-down guardrails, and audit-ready evidence is not a reliability tax you pay reluctantly. It is the only configuration in which speed and stability stop fighting, which in this domain is the difference between shipping and dropping the call.

Sources

Mateusz Ulas
Mateusz Ulas