What is operational resilience and why is it important?

Operational resilience is an organization's ability to prevent, adapt to, respond to, recover from, and learn from disruptions while continuing to deliver important business services within defined impact tolerances. It matters because concentrated dependencies turn single failures into systemic ones, and regulators in the UK, EU, and Australia now require firms to prove they can keep critical services running.

What is the difference between operational resilience and business continuity?

Business continuity is a holistic process that begins with a business impact analysis, develops business continuity plans and recovery strategies, and maintains critical operations during and after disruption. Operational resilience encompasses business continuity rather than replacing it, adding an outcome focus: assuring end-to-end delivery of important business services within impact tolerances, including third-party and technology dependencies.

What are impact tolerances and important business services?

An important business service is an externally facing service whose failure would cause intolerable harm to customers or markets. An impact tolerance is the maximum tolerable level of disruption to that service, usually a time threshold, set from harm rather than internal recovery convenience. UK PRA SS1/21 requires firms to identify both and test that tolerances can be met.

How do you test operational resilience through scenario testing?

You test important business services against severe-but-plausible scenarios such as a data-center loss, supplier outage, or ransomware event, using tabletop and simulation exercises, then measure whether each service stayed within its impact tolerance. DORA Article 24 requires a structured testing program. The purpose is to reveal hidden dependencies and feed lessons back into the program, not to pass an audit.

What lessons can organizations learn from the CrowdStrike outage?

The July 2024 CrowdStrike incident, where a faulty Channel File 291 update bricked roughly 8.5 million Windows devices, showed that a single trusted vendor can become your outage. Firms that recovered fastest had rehearsed manual workarounds and a clear map of which important services touched the affected estate. Concentration risk stays invisible until the day a shared dependency fails.

What is the role of the board in operational resilience governance?

The board owns the decisions a resilience program produces: where to invest in remediation, which vulnerabilities to accept, which vendors to change, and when to recalibrate a tolerance. APRA CPS 230 §23–24 and the UK Senior Managers regime place this accountability explicitly at the top of the house, supported by self-assessments and board reporting.

How does operational resilience apply to third-party and supply chain risk?

Each important business service depends on third and fourth parties, and a single supplier can support several services, creating concentration risk. Regulators now mandate oversight: APRA CPS 230 §51 requires a material service provider register, and DORA imposes parallel ICT third-party obligations. Change Healthcare showed how a fourth-party clearinghouse most firms never mapped could disable an entire sector.

What Is Operational Resilience? Definition, Frameworks, Examples

Q: What is DORA and when did it come into effect?

The Digital Operational Resilience Act (DORA), Regulation (EU) 2022/2554, applied across the EU from 17 January 2025 to roughly 20 types of financial entity. It covers ICT risk management, incident reporting, resilience testing, third-party risk, and information sharing. Article 24 mandates a digital operational resilience testing program, and Article 12 governs backup and restoration policies.

The organizations that struggled the most once an crisis hit is the ones weren't the ones missing a plan. They were the ones who couldn't answer a simpler question: which of our critical services just stopped, and how long can we survive that.

That question sits at the center of operational resilience. A binder of recovery procedures tells you what to do; it rarely tells you what's actually at risk, for how long, and whose downstream service breaks when yours does. The rest of this guide unpacks how operational resilience is defined and regulated, how it relates to business continuity and disaster recovery, the building blocks every program needs, and what recent incidents reveal about the gap between documentation and capability.

What Is Operational Resilience?

Operational resilience has a specific, regulator-shaped meaning that goes well beyond an informal sense of being robust. The term entered most practitioners' vocabulary through financial-services rulebooks, and those rules pin it to an outcome: the continued delivery of the services customers and markets depend on, even when individual systems fail.

Operational resilience is the ability of an organization to prevent, adapt to, respond to, recover from, and learn from operational disruptions while continuing to deliver important business services within predefined impact tolerances. It shifts the focus from protecting individual assets to assuring end-to-end delivery of services that customers and markets rely on.

The regulatory origin matters because it makes the definition testable rather than aspirational. UK firms operate against FCA PS21/3 and the SYSC 15A.2 rules, which require them to name their important business services and prove they can keep delivering them inside a stated tolerance. That is a measurable claim, not a posture.

Why the definition shifts focus from processes to services

Traditional continuity work tended to protect individual processes and recover individual systems. Operational resilience reorganizes the question around the service a customer or counterparty actually receives, such as a mortgage drawdown, a card authorization, or an admissions booking, and works backwards from that outcome through every dependency that supports it.

The PRA's SS1/21 supervisory statement, at §4.12–4.16, frames important business services as those whose disruption could cause intolerable harm to consumers or risk to market integrity. The unit of analysis is the service, not the internal task that feeds it.

This is also where documentation-heavy programs run aground. A plan was never designed to make anyone more resilient. Mapping a service end-to-end, then testing whether it survives a severe disruption, is what produces resilience. Mapping the critical dependencies behind each service is the work that most binders skip.

Operational Resilience vs Business Continuity, Disaster Recovery, and Operational Risk

These four terms get used interchangeably, and that conflation causes real confusion for anyone newly tasked with resilience. They describe distinct, nested disciplines. Each has its own artifacts, its own success criteria, and its own place in a complete program. Getting the relationship right lets a practitioner position their work without devaluing the foundations it stands on.

Dimension	Business Continuity	Disaster Recovery	Operational Risk Mgmt	Operational Resilience
Primary focus	Maintaining critical operations	Restoring IT systems	Controlling causes of loss	Assuring end-to-end service delivery
Unit of analysis	Business function	Technology asset	Risk event	Important business service
Key metric	RTO / recovery strategy	RTO / RPO	Loss likelihood and impact	Impact tolerance
Core assumption	Disruption must be recovered	Systems must be restored	Risk can be reduced	Disruption will occur regardless
Foundational artifact	BIA and BCP	DR runbook	Risk register	Service map and tolerance statement

How business continuity fits inside operational resilience

Business continuity is a holistic process. It begins with a business impact analysis, informs the development of business continuity plans and recovery strategies, and helps an organization maintain critical operations during and after disruption. None of that is retired by operational resilience. The BIA in particular remains a foundational, non-negotiable artifact. You cannot set a credible impact tolerance without first understanding what a function does and what breaks when it stops.

Many resilience programs build on the Plan-Do-Check-Act backbone of a business continuity management system. ISO 22301:2019, Clause 8.4 specifies the business continuity plans and procedures an organization needs to manage disruption. Operational resilience sits above this layer, asking whether the sum of those plans actually assures service delivery, a question a single plan, however well written, cannot answer on its own.

Disaster recovery, operational risk, and cyber resilience compared

Disaster recovery is the IT-restoration subset, concerned with bringing technology back within defined recovery time and recovery point objectives. Operational risk management identifies and controls the causes of failure. Operational resilience assumes failure will happen anyway and asks whether the service survives it. Cyber resilience is a critical input but narrower than the all-hazards scope resilience demands.

The NIST Cybersecurity Framework 2.0, with its Govern, Identify, Protect, Detect, Respond, and Recover functions, is a strong example of a cyber-focused discipline that feeds operational resilience without being coextensive with it. A ransomware event is one of many severe-but-plausible scenarios a resilient firm must withstand, not the whole of the problem.

For a deeper treatment of where these lines fall, the distinction between business resilience and business continuity covers the adjacent vocabulary practitioners trip over most often when they first inherit a resilience remit.

Why Operational Resilience matters now

Three forces pushed operational resilience onto board agendas. Disruption became more frequent and more interconnected. Single points of failure turned into systemic ones. And regulators converted good practice into binding obligation. The business case rests on data, not on rhetoric about a volatile world.

Disruption is more frequent, costly, and interconnected

Business interruption and cyber risk now sit at the top of global risk rankings. The Allianz Risk Barometer 2026 placed cyber incidents as the top global business risk, with business interruption second at 31% of responses. The BCI Horizon Scan Report 2025 likewise rated cyber the highest-ranked threat both for the coming year and the next five to ten.

The costs are concrete. IBM's 2025 Cost of a Data Breach analysis found US breach costs rose 9% year-over-year to $10.22 million, even as the global average declined. What makes these numbers a resilience problem rather than a security one is concentration. When many firms depend on the same payments processor, cloud region, or endpoint agent, a single failure propagates across the system. That is the systemic-risk dynamic regulators now watch closely.

Regulators have made resilience non-negotiable

The regulatory wall has already arrived. The UK's full-compliance deadline passed on 31 March 2025, DORA applied across the EU from 17 January 2025, and APRA's CPS 230 took effect on 1 July 2025. These are not consultation papers. They are live obligations with self-assessment and board-attestation requirements.

The Bank of England now treats operational resilience as a financial-stability concern, not merely a firm-level one. Its Financial Stability in Focus report on operational resilience frames the macroprudential case: a disruption at a systemically important institution can ripple through the wider financial system. Penalties are real, too. The regulators have already fined firms for resilience failings, as the TSB case below shows. For UK firms specifically, the FCA operational resilience requirements translate these expectations into supervisory practice.

The Core Components of an Operational Resilience Framework

Across every regulatory regime, the same five building blocks recur. Whether a firm answers to the FCA, DORA, or APRA, the underlying method is consistent. Name the services that matter. Understand what they depend on. Decide how much disruption is tolerable. Test that decision against reality. Feed what you learn back into the program.

Identifying important business services and critical operations

Important business services are externally facing services whose failure would cause intolerable harm to customers, threaten market integrity, or endanger the firm's own viability. They are not the same as internal processes. A payroll run is an internal process; the ability of a retail bank's customers to make a payment is an important business service.

The identification exercise works from the customer or market outcome backwards. The PRA's SS1/21, at §4.12–4.16, requires firms to identify these services explicitly rather than infer them. Done well, the identification of important business services reframes the entire program around outcomes the reader's customers would actually notice losing.

Setting and testing impact tolerances

An impact tolerance is the maximum tolerable level of disruption to an important business service, usually expressed as a time threshold or a quantified metric. It is set from the point of view of harm to customers and markets, not from internal recovery convenience. This is the crucial difference between a tolerance and a recovery time objective. An RTO asks how fast IT can restore a system. A tolerance asks how long the service can be down before the harm becomes intolerable.

Consider a payments firm that sets a tolerance of two hours for its card-authorization service. The PRA's SS1/21, at §5.3 expects firms to test whether they can remain within that tolerance under severe-but-plausible conditions, not merely to assert it. If the firm's failover takes three hours, the tolerance is breached on paper before any incident occurs. That gap is exactly what the program exists to surface. Setting credible impact tolerances is where many programs first discover their recovery assumptions don't hold.

Mapping dependencies and third-party risk

Every important service rests on a chain of people, processes, technology, facilities, and third and fourth parties. Mapping that chain reveals where a single supplier or system supports several services at once. That is concentration risk, and it turns one failure into many. Critical third-party providers create systemic single points of failure that no internal plan can fully control.

Regulators now mandate this oversight directly. APRA's CPS 230, at §51, requires a register of material service providers, and DORA imposes parallel obligations on ICT third parties. Managing third-party risk is no longer a procurement footnote. It is central to whether a service survives.

Scenario testing and learning loops

Testing exists to reveal hidden dependencies, not to pass an audit. A program tests its important services against severe-but-plausible scenarios, such as a data-center loss, a key supplier outage, or a ransomware event, through tabletop and simulation exercises, then measures whether the service stayed within tolerance. DORA's Article 24 mandates a digital operational resilience testing program precisely so that weaknesses surface in an exercise rather than in production.

The learning loop closes the cycle. Lessons from each test feed back into service maps, tolerances, and remediation plans, the literal "learn" in the resilience definition. The point is not just to fix the immediate fault but to question the assumptions that let it persist. Structured scenario testing is what converts a static map into a capability that improves with use.

The Global Regulatory Landscape: UK, EU, Australia, and the US

Obligations now differ by jurisdiction, but they rhyme. Identify services, set tolerances, map dependencies, test, and assign board accountability. That pattern repeats across the UK, EU, and Australia, with the US taking a more guidance-led route. The table below compares the major regimes; the subsections add the clause-level detail a compliance lead needs.

Regime	Authority	Key obligations	Compliance milestone
UK SS1/21 + PS21/3	PRA and FCA	Important business services, impact tolerances, mapping, scenario testing	Full compliance 31 March 2025
DORA	EU (via EBA, EIOPA, ESMA)	ICT risk management, incident reporting, resilience testing, third-party risk, information sharing	Applied 17 January 2025
CPS 230	APRA	Critical operations, tolerance levels, BCPs, service-provider management	Effective 1 July 2025
US guidance	FFIEC / interagency	Sound practices for operational resilience; no single binding resilience rule	Ongoing supervisory expectation

UK: PRA SS1/21 and FCA PS21/3

The UK operates a joint PRA and FCA regime. Firms had until 31 March 2025 to demonstrate they could remain within their impact tolerances for each important business service. The joint PS6/21 policy statement and the FCA's PS21/3 rules under SYSC 15A.2 require important business services, impact tolerances, dependency mapping, and scenario testing.

The regime applies to banks, building societies, insurers, and a broad set of FCA-regulated firms. Its breadth is deliberate: the regulators wanted the duty to reach beyond the largest institutions. The detail lives in the FCA SS1/21 final rules, which spell out supervisory expectations clause by clause.

EU: the Digital Operational Resilience Act (DORA)

DORA applied from 17 January 2025 across some 20 types of financial entity, as the EIOPA overview confirms. It rests on five areas: ICT risk management, incident reporting, resilience testing, third-party risk management, and information sharing. Unlike the UK's principles-based approach, DORA is a directly applicable regulation with detailed technical standards.

Two obligations anchor the testing pillar. Article 24, in the EBA's interactive rulebook, sets the general requirements for a digital operational resilience testing program, and Article 12 of the regulation governs backup and restoration policies.

Australia (APRA CPS 230) and US guidance

APRA's CPS 230 took effect on 1 July 2025. It requires regulated entities to identify critical operations, set tolerance levels, maintain business continuity plans, and manage service-provider risk. Crucially, CPS 230, at §23–24, places explicit accountability on the board and senior management. Resilience is a governance duty here, not a back-office exercise. APRA's operational risk management hub houses the accompanying CPG 230 guidance.

The US route differs. Rather than a single binding resilience regulation, US supervisors rely on interagency and FFIEC guidance, leaving firms to interpret sound-practices expectations against their own risk profile. The practical effect is a less prescriptive but still real expectation that institutions can sustain critical operations through disruption.

Operational Resilience Examples: Lessons from Recent Incidents

Abstract definitions become concrete under stress. Four incidents across financial services, healthcare, and technology supply chains show what resilient versus brittle responses look like. Each one turns on the same question of whether an organization could see its real dependencies before they failed.

CrowdStrike (2024): when a third-party update became your outage

A faulty Channel File 291 content update to CrowdStrike's Falcon sensor, pushed on 19 July 2024, sent Windows endpoints into boot loops worldwide. The update affected approximately 8.5 million Microsoft Windows devices, close to 1% of all Windows systems globally. CrowdStrike reverted the faulty file quickly, but recovery required machine-by-machine manual intervention, so the fix was slow even though the root cause was identified fast.

The operational damage cascaded across sectors. Airlines grounded fleets, hospitals fell back to paper, and emergency lines and banks were disrupted, as BBC coverage documented. The firms that fared best had two things the others lacked: manual workarounds rehearsed in advance, and a clear map of which important services touched the endpoint estate. The takeaway for practitioners is uncomfortable. A single trusted vendor can become your outage, and concentration risk is invisible until the day it isn't.

Change Healthcare and Ascension (2024): healthcare's hidden dependencies

In February 2024, the ALPHV/BlackCat ransomware group breached Change Healthcare through a Citrix remote-access portal that lacked multi-factor authentication, then encrypted systems that process roughly half of all US medical claims. The fallout was vast. 94% of hospitals reported financial impact and 74% reported direct patient-care impact, while costs to UnitedHealth Group reached $2.457 billion by the third quarter and nearly 193 million individuals were affected. The US Office of Financial Research brief catalogued the systemic spread.

Months later, in May 2024, a ransomware attack on Ascension forced clinicians across 140 hospitals back to paper-based processes, as healthcare attack-trend analysis recorded. Both incidents teach the same lesson. Impact tolerances must account for fourth-party dependencies that never appear on an org chart. A hospital that had never identified Change Healthcare as a critical node in its claims-processing service had no way to set a tolerance against losing it.

TSB (2018): when an IT migration breaks every channel at once

In April 2018, TSB attempted to migrate 1.3 billion customer records from Lloyds Banking Group's systems to a platform run by its Spanish parent, Sabadell. The new platform failed immediately, locking a large share of 5.2 million customers out of branch, telephone, online, and mobile banking at once, with issues persisting into December. The FCA fined TSB £48.65 million jointly with the PRA for operational risk and governance failures, and the Bank of England's notice confirmed total costs exceeding £378 million.

TSB earns its place here despite predating the 2024 examples because it is the canonical case that helped shape the UK regime. Pair it with CrowdStrike and Change Healthcare and the pattern is identical: an organization that could not see how a single change would cascade across every channel its customers relied on. Older incident, same dependency-blindness, and the £48.65 million fine that put operational resilience on every UK board agenda.

How to Build Operational Resilience: A Practical Methodology

Building resilience follows a repeatable analytic path. The sequence below frames the work as inputs, analytic steps, and the artifacts and decisions it produces, grounded in the regulatory regimes above rather than invented from scratch. Run it as a cycle, not a one-time project.

Determine which regulatory regime applies and scope the program accordingly.
Gather your service catalog, existing BIAs, third-party register, and incident history.
Identify important business services from the customer or market outcome backwards.
Map each service's end-to-end dependencies, including third and fourth parties.
Set impact tolerances from harm, not from internal recovery convenience.
Run severe-but-plausible scenario tests against each tolerance.
Feed gaps back into mapping and remediation, then report to the board.

Inputs: what you gather before you start

The quality of the output depends on the quality of the inputs. Before any mapping begins, assemble the service catalog, existing business impact analyses, the third-party register, the technology asset inventory, and the prior incident history. A strong business impact analysis is the single most valuable input. It already encodes what each function does and how quickly its loss hurts.

Stakeholder input matters as much as documents. Operations, risk, IT, and front-line staff each hold part of the dependency picture. Front-line knowledge in particular surfaces the workarounds and undocumented reliances that never reach a register.

Analytic steps: identify, map, set tolerances, test

The core analytic loop is to identify important business services, map their end-to-end dependencies, set impact tolerances, and run severe-but-plausible scenario tests. Each step iterates. A gap found in testing resets the prior steps. A tolerance that can't be met sends you back to mapping, and a newly discovered dependency reopens the test.

The PRA's SS1/21, at §4.12–4.16 and §5.3, describes this as an ongoing expectation rather than a milestone. The point is repetition. A program that maps once and files the result has produced a document, not a capability.

Artifacts and decisions: what the program produces

The method yields a defined set of artifacts: a service map, a dependency map, impact tolerance statements, a self-assessment, and a board report. These are the evidence a regulator will ask to see, and the inputs a board needs to govern.

The decisions matter more than the documents. Where to invest in remediation, which vulnerabilities to accept, which vendors to change, and when to recalibrate a tolerance. These are board-level calls. §23–24 of CPS 230 and the UK's Senior Managers regime both place that ownership explicitly at the top of the house. A practical operational resilience framework ties these artifacts and decisions into a single governed cycle.

Industry-Specific Nuances: Financial Services and Healthcare

Resilience plays out differently by sector because the harm of failure differs. What counts as an intolerable disruption to a trading desk is not what counts as one in an emergency department. Two sectors illustrate the range, and both reward a practitioner who reads the regime through the lens of who gets hurt first.

Financial services: payments, market integrity, and concentration risk

In financial services, impact tolerances are tied to market integrity and consumer harm rather than to internal downtime alone. Payments and trading dependencies create concentration risk at a financial-stability level, which is precisely why the Bank of England's macroprudential analysis treats the operational resilience of systemically important firms as a systemic concern.

The regulatory burden is also heaviest here. DORA and the PRA/FCA regime impose the most demanding testing and third-party oversight obligations of any sector. Firms working through the overlap will recognize the challenge of operational resilience for financial services, where DORA, the FCA rules, and CPS 230 can all apply to different parts of the same group.

Healthcare: patient safety as the impact tolerance

In healthcare the tolerance is measured in patient harm, not revenue or downtime. The Ascension ransomware attack of May 2024 showed both the stakes and a genuine control: paper-based fallback procedures kept 140 hospitals running, however slowly, when systems went dark, as the healthcare attack-trend analysis noted.

The deeper lesson is fourth-party exposure. Change Healthcare demonstrated that a single clearinghouse most providers never thought of as their own dependency could take out claims processing across an entire sector. A healthcare resilience program that maps only its direct vendors will miss the node that matters most when claims stop flowing.

Why Documentation-Heavy Programs Fail

Many organizations mistake a shelf of plans for resilience. The gap between a compliant binder and an operational capability is where incidents do their damage, and it is a gap that testing, not documentation, closes. This is the status quo worth naming plainly.

Compliance theater versus operational capability

The brittle organization has plans. What it cannot do, when an incident lands, is answer quickly which services are at risk and for how long. The resilient organization has mapped its dependencies and tested its tolerances against scenarios that resemble the day it is now living through.

The difference is not effort. Both organizations worked hard. The difference is what the work was designed to produce. A program built to satisfy an examiner produces evidence of compliance. A program built to survive disruption produces tested tolerances and a current dependency map. The two can coexist, but only the second one helps at 2 a.m. The wider problem of compliance theater is that an improving audit record can mask flat operational readiness.

What good looks like: measurable, tested, and governed

Resilience maturity is observable. It shows up as tested tolerances, live dependency maps, and remediation decisions taken at board level, not as a page count. The metric that matters is the gap between actual recovery performance and the stated impact tolerance, not whether an RTO was nominally met.

New dependency blind spots keep appearing, and they need governing explicitly. IBM's 2025 analysis found that shadow AI added $670,000 to average breach costs, and 97% of AI-related breaches lacked proper access controls. Unsanctioned AI tools and shadow IT are exactly the kind of undocumented dependency that scenario testing exists to surface before an incident does. A program that treats resilience as a documentation cycle will keep discovering these dependencies the hard way, one breach at a time.

Operational resilience is one discipline within a broader enterprise resilience capability, alongside business continuity, disaster recovery, and crisis management, and it is strongest when those disciplines reinforce rather than duplicate each other.

Discover how Fortiv's operational resilience solutions help teams keep critical operations running through disruption →