Back to Blog
Disaster Recovery

What is disaster recovery? A complete guide

What is disaster recovery? A complete guide

Most organizations discover the gap between their disaster recovery documentation and their actual recovery capability at the worst possible moment: mid-outage, clock running, working from a plan that hasn't been exercised since the last audit cycle. The binder says four hours. The restore takes three days, because a dependency moved, a credential expired, or the backups were never actually tested against a clean rebuild.

That gap is the subject of this article. The sections below walk through what disaster recovery actually is, the metrics that make it measurable, the architectures available to hit those metrics, and the testing and regulatory pressure that increasingly demand evidence rather than intent.

What Is Disaster Recovery?

Disaster recovery is the set of policies, tools, and tested procedures an organization uses to restore IT systems, applications, and data after a disruptive event, within timeframes the business can survive. It is a continuously exercised capability rather than a one-time plan, focused specifically on the technology layer that critical operations depend on.

The word that does the work in that definition is tested. A documented recovery procedure that has never been executed against production-like conditions is a hypothesis. It might be right. You find out which when the outage hits.

Disaster recovery defined

Disaster recovery covers the systems, applications, data stores, and the dependencies that connect them. Its scope is deliberately IT-centric. Restoring a database server is part of it; deciding which business service that server serves, and how quickly it must come back, belongs upstream in continuity planning and feeds the DR targets.

The NIST SP 800-34 Rev. 1 Contingency Planning Guide frames this as a seven-step process running from policy and business impact analysis through preventive controls, recovery-strategy selection, plan development, testing, and maintenance. The structure matters because it places recovery objectives and testing inside the same loop, not as bookends around a static document. A plan written once and reviewed annually is not following that loop; it is approximating it badly.

Why disaster recovery matters now

Business interruption is not an exotic tail risk. It has ranked first or second in every Allianz Risk Barometer for the past decade, and the surveyed risk managers are the people who pay the claims. Cyber sits alongside it: the BCI Horizon Scan 2025 put cyber security at 63.6% as the overwhelming long-term concern among continuity professionals.

The reason these two risks travel together is concentration. A single vendor failure can cascade across thousands of organizations within hours. On 19 July 2024, a faulty Channel File 291 content update to CrowdStrike's Falcon sensor pushed at 04:09 UTC sent roughly 8.5 million Windows endpoints into boot loops. CrowdStrike reverted the update within 78 minutes, but the fix did not help: recovery required someone to touch each machine individually, often in safe mode, often deleting a specific driver file by hand. Airlines grounded fleets. Hospitals fell back to paper. The trigger was not a hostile actor or a natural disaster. It was a routine update to trusted security software, and recovery time was governed entirely by how manual the restore process turned out to be.

Disaster Recovery vs Business Continuity

Disaster recovery and business continuity get used interchangeably, and the conflation causes real operational damage. They answer different questions at different scopes. Business continuity is the wider discipline; disaster recovery is the IT-restoration practice nested inside it.

Getting the relationship right is not pedantry. It determines whether your recovery targets are driven by what the business can tolerate or by what the infrastructure team finds convenient.

How the two disciplines relate

Business continuity is a holistic process. It begins with a business impact analysis, informs the development of business continuity plans and recovery strategies, and helps the organization keep critical operations running during and after disruption. Disaster recovery is the subset of that process concerned with restoring the IT systems, data, and applications those operations rely on.

The ISO 22301 business continuity standard makes the sequencing explicit: Clause 8.2 requires the business impact analysis and risk assessment, and Clause 8.3 requires recovery strategies derived from them. DR strategies are one output of that 8.3 work. Enterprise resilience encompasses both BCM and DR and does not retire either; it adds the systemic, cross-dependency view that individual recovery plans cannot see on their own. For the broader frame, see how this connects to organizational resilience as a whole.

The business impact analysis is the non-negotiable input feeding both disciplines. If you want to understand why so many programs struggle here, the failure usually traces back to business impact analyses that produce bad data about what depends on what.

Why the distinction matters operationally

Restoring servers without restoring the business service is a half-recovery that looks complete on a status dashboard. The infrastructure team reports systems green. The payments desk still cannot take orders, because a downstream reconciliation feed nobody mapped is still down.

That is what happens when DR targets are set by IT convenience rather than business impact: systems come back in the wrong order. The easy ones come back first. The ones the business actually needed are still queued. Deriving every recovery objective from impact-over-time is what avoids this, which is what the next section covers.

RTO, RPO, and the metrics that drive recovery

Recovery metrics are the translation layer. They take a business statement ("we cannot be down past lunchtime on a trading day") and turn it into an engineering target a platform team can architect against. Get the metrics wrong and you either overspend on resilience nobody needs or underspend on the systems that will hurt you.

The two that matter most are Recovery Time Objective and Recovery Point Objective. Everything else in this section either feeds them or bounds them.

RTO and RPO explained

Recovery Time Objective (RTO) is how long restoration can take before the harm becomes unacceptable to the business. Recovery Point Objective (RPO) is how much data, measured in time, the business can afford to lose. An RPO of fifteen minutes means your replication or backup cadence must guarantee no more than a quarter-hour of lost transactions.

MetricWhat it measuresWhat it constrainsExample target
RTOMaximum tolerable restoration timeArchitecture, failover speed, automation1 hour for a payments gateway
RPOMaximum tolerable data loss (in time)Backup/replication frequencyNear-zero for trade settlement
MTPD/MAOOuter survivability limit for the outageOverall recovery investment8 hours for a customer portal
MBCOMinimum acceptable service during recoveryDegraded-mode design40% of normal throughput

Tighter targets cost more, and the cost curve is steep near zero. The NIST contingency planning guide is blunt that recovery objectives must come out of the business impact analysis, not from an aspirational round number. An RTO of "as fast as possible" is not a target. It is an abdication.

MTPD, MBCO, and RCO

A few more metrics bound the picture. Maximum Tolerable Period of Disruption (MTPD), sometimes called Maximum Acceptable Outage (MAO), sets the outer limit past which the organization may not recover at all. Your RTO must sit comfortably inside it. Minimum Business Continuity Objective (MBCO) defines the reduced service level you must hit during recovery, often well below normal. Recovery Consistency Objective (RCO) captures how consistent replicated data must be when you fail over.

ISO 22301 Clause 8.2 anchors MTPD in the business impact analysis, which is where these limits are derived rather than guessed. The deeper treatment of recovery time objective versus recovery point objective works through how the metrics interact.

Setting targets from a business impact analysis

Rank services by impact-over-time, not by which team shouts loudest. A service that costs little in the first hour but becomes catastrophic at hour twelve has a different recovery profile than one that hurts immediately.

Consider a payments authorization workload at a mid-size bank. The business impact analysis shows that an outage during business hours costs reputational and regulatory damage within minutes, and settlement obligations make data loss intolerable. That maps to a near-zero RPO (synchronous replication) and a sub-hour RTO (warm or active-active failover). A monthly batch reporting job at the same bank tolerates a 24-hour RTO and an overnight RPO. Same institution, two orders of magnitude apart in recovery cost. The honesty of the RTO and RPO metrics that flow from a business impact analysis is bounded by the honesty of the dependency mapping underneath them and that is usually where the real argument is.

Types of disasters disaster recovery must address

Different disaster classes fail in different ways and demand different recovery sequences. A flooded data center is a location problem. A corrupted dataset is a trust problem. Treating them with one generic plan is how organizations end up restoring from backups that are themselves compromised.

The cyber category in particular has fractured the old assumptions, which is why it gets its own section later.

Natural, technical, and human-caused disruptions

Natural events (flood, fire, storm, extended power loss) take out physical infrastructure and are the classic case for geographic site separation. Technical failures (hardware faults, software defects, expired certificates, capacity exhaustion) often hit without warning and can be self-inflicted. Human error remains one of the most common triggers and one of the hardest to design against a fat-fingered deletion, a misapplied configuration, an accidental production change.

The CrowdStrike outage was a technical failure that was self-inflicted at the vendor layer and inherited by everyone downstream. It also showed how recovery sequencing differs by class: there was no failover site to cut to, because the problem was on the endpoints themselves, so recovery meant physically reaching millions of machines. A geographic DR strategy built for natural disasters offered no help at all.

Cyber and ransomware as a distinct DR category

Ransomware breaks the foundational assumption of traditional DR: that your backups are trustworthy. A wiper or encryptor does not just take systems down. It can encrypt, delete, or silently corrupt your recovery points, sometimes dwelling in the environment long enough to poison several backup generations.

The Colonial Pipeline ransomware attack in May 2021 remains the canonical reference for why a basic control gap becomes a national event. DarkSide operators entered through a single legacy VPN account that lacked multi-factor authentication. The company shut down the largest US refined-fuel pipeline for six days and paid a $4.4 million ransom, of which the DOJ later recovered $2.3 million. The same entry pattern, compromised credentials without MFA, recurred in 2024 against roughly 165 Snowflake customers, where the UNC5537 actor used infostealer-harvested logins to exfiltrate data from organizations including AT&T. Cyber recovery needs clean, isolated, validated backups and forensic confidence that the environment is clean before you restore. That is a different discipline from spinning up a warm site.

Disaster recovery strategies and site options

Every DR strategy is a deliberate trade between standing cost and recovery speed. Choosing well means matching the architecture to the RTO and RPO the business has actually signed off on, not the ones the infrastructure team wishes it had budget for.

The spectrum runs from cheap-and-slow to expensive-and-instant, and most enterprises run several points on it simultaneously, tiered by service criticality.

Backup-and-restore and cold/warm/hot sites

Backup is not disaster recovery. Backup preserves data; recovery restores operational capability. The gap between them is the restore, and the restore is where untested assumptions go to die. A backup you have never successfully restored under realistic conditions is a guess about your recovery time.

Site strategies trade standing cost against recovery speed across a clear gradient.

Site typeStanding infrastructureTypical RTORelative costFits
Cold siteSpace and power, no live systemsDaysLowTolerant back-office workloads
Warm sitePre-staged hardware, stale dataHoursMediumImportant but not real-time services
Hot siteFully mirrored, near-live dataMinutesHighTrading, payments, customer-facing core

For a financial-services trading workload, the math usually forces a hot site or active-active design: the cost of even an hour's downtime during market hours dwarfs the standing expense of a mirrored environment. For the same firm's internal HR system, a cold site is a defensible, cheaper choice. The Maersk NotPetya attack in June 2017 is the canonical lesson in why this matters: NotPetya, spread through a compromised Ukrainian accounting-software update, wiped systems across 600+ offices and 76 port terminals. Maersk could rebuild its Active Directory only because a single domain controller in a Ghana office happened to be offline during a power cut and survived. A $250–300 million recovery hinged on luck, not design.

Cloud and hybrid DR architectures

Cloud gives you finer-grained points on the cost/speed curve. Pilot light keeps a minimal core running so you can scale up fast. Warm standby keeps a scaled-down but functional copy live. Active-active runs production across regions so a single-region loss is barely visible. Each maps to a tightening RTO, and each costs more.

Multi-region failover also reduces single-site concentration risk, but the CrowdStrike event reminded everyone that geography is not the only concentration that matters. A shared software dependency can take out every region at once. Disaster Recovery as a Service (DRaaS) shifts the operational burden of running this to a provider; it does not shift accountability. When the regulator asks for evidence of a tested recovery, "our DRaaS vendor handles it" is not an answer.

What goes into a disaster recovery plan

A disaster recovery plan is the operational artifact that turns strategy into a repeatable, ownership-clear recovery sequence. The difference between a working plan and a binder is whether someone can pick it up at 3am and execute it without the author in the room.

The components below are table stakes. The maintenance discipline that keeps them accurate is where most programs quietly fail.

Core plan components and ownership

A usable plan contains the recovery sequence (what comes back, in what order), runbooks specific enough to follow under stress, contact and escalation trees with named individuals, and dependency maps that reflect the architecture as it is now. NIST SP 800-34 treats plan development and maintenance as a single recurring activity, not a publish-and-forget step.

Ownership is where plans rot. Every recovery action needs a named owner with documented authority to invoke, and a named backup for when that person is on a plane. A recovery binder that runs to hundreds of pages and cannot be navigated quickly under pressure is not an asset; it is exposure.

Why plans drift out of date

Architecture changes weekly. Vendors get swapped. A service migrates to a new region. Plans, reviewed annually, cannot keep pace with environments that change continuously. The plan that passed last year's audit describes a system that no longer exists.

This is the failure mode that catches sophisticated organizations. A plan can pass audit on paper, sail through the checklist, tick the certification box, and then fail during an actual outage because three dependencies shifted and the plan was never re-tested against the new reality. The DRII Professional Practices for BCM treat ongoing maintenance as a core competency precisely because the document degrades the moment it is published. Automated currency checks against the live environment beat a manual annual review, because the environment does not wait for your review cycle. This is the same pattern behind why business continuity plans measure activities instead of readiness and tend to fail in real incidents.

Disaster recovery testing: types and frequency

A disaster recovery plan is worth exactly what its last test proved, and untested plans fail precisely when the stakes are highest. Testing is not a compliance chore. It is the only mechanism that converts a documented assumption into demonstrated capability.

The exercise types below probe different risks, and the right program uses several of them at different cadences.

Read our guide on exercise and simulations

From tabletop to full-interruption testing

Four exercise types sit on a rising scale of realism and risk.

Test typeWhat it validatesProduction riskTypical use
TabletopDecisions, roles, communicationNoneQuarterly, after plan changes
SimulationProcess under timed pressureLowSemi-annually for critical services
ParallelFailover works without cutoverLowAnnually for tier-1 systems
Full interruptionTrue recovery under real failoverHighCarefully, for the most critical workloads

A tabletop exercise surfaces decision and communication gaps cheaply. A parallel test validates that failover actually works without endangering production. Full-interruption testing is the most honest and the most dangerous, because it genuinely fails over live. Each exercise should produce an evidence trail the actual-versus-target RTO and RPO not a binary pass that tells an auditor nothing.

How often to test and what to capture

Critical systems warrant at least annual realistic testing, and meaningfully more often for tier-1 workloads. The cadence trap is testing only on the calendar. Test after major change, not just every twelve months, because the change is what invalidated the plan.

In-scope EU financial entities must now test their ICT recovery capability at least annually under DORA Article 24, with threat-led penetration testing on a longer cycle for the most significant firms. Capture actual-versus-target recovery times and, more importantly, remediate the gaps. A test that finds a four-hour RTO is really seven hours has done its job only if the seven becomes four before the next exercise. Many programs measure activity instead of readiness here, which is the trap disaster recovery testing frequency and method covers in detail.

Ransomware recovery: a discipline of its own

Cyber incidents break the core premise of traditional DR, which is that your backups are clean. When an attacker has been in the environment for weeks, your recovery points may already be compromised, and restoring from them simply reinfects you.

This section walks the scenario where clean, tested, isolated backups and a rehearsed restore sequence are the difference between days and weeks of downtime.

Why ransomware recovery differs from traditional DR

In a traditional DR event, you trust your last good backup and restore it. In a ransomware event, you cannot. Backups may be encrypted, deleted, or silently corrupted, and the malware may be sitting dormant in the very images you would restore. Recovery requires forensic confidence that systems are clean before you bring them back, which is why immutable, air-gapped, regularly-validated backups have become standard practice for cyber recovery.

The Maersk NotPetya rebuild is the textbook case for what good and lucky look like at once: a full IT infrastructure rebuild across 130 countries, enabled by one surviving domain controller. The 2024 Change Healthcare ransomware incident showed the modern version: a BlackCat/ALPHV intrusion through a Citrix portal without MFA led to weeks of pharmacy and claims-processing disruption and exposed the health data of a substantial share of the US population, despite a reported ransom payment. The practitioner takeaway is that recovery speed in a cyber event is governed by how clean and rehearsed your restore path is, not by how recent your backup is. Organizations that lean on extensive automation in detection and response measurably reduce their breach costs, with mature automation saving nearly $1.9 million on average per breach, because they shorten the window between compromise and clean recovery.

Regulatory and audit expectations for disaster recovery

Regulators have shifted from accepting documented intent to demanding evidence of tested recovery. The plan on the shelf is no longer sufficient. Across the major regimes, the burden of proof has moved onto the firm to show that recovery works, repeatedly, with artifacts.

The regimes below shape DR obligations for financial and critical-infrastructure firms in particular, but their direction of travel is industry-wide.

Read our article on regulations and standards for resilience.

DORA and EU financial-sector requirements

The EU Digital Operational Resilience Act applies to financial entities and their critical ICT providers, and it is specific about recovery. DORA Article 11 mandates ICT business continuity policies and response-and-recovery plans. Article 12 sets explicit backup, restoration, and recovery requirements. Article 24 requires testing of critical systems, building toward threat-led penetration testing for the most significant firms.

DORA entered application on 17 January 2025, so this is live obligation, not future planning. The detail of meeting it lives in the DORA compliance guide and the practical DORA compliance checklist.

UK operational resilience and APRA CPS 230

The UK regime works through impact tolerances rather than prescriptive recovery targets. PRA SS1/21 requires firms to identify important business services, set impact tolerances for them, map the resources behind them, and test against severe-but-plausible scenarios. The FCA's operational resilience rules under SYSC 15A mirror this, with the transition period having ended on 31 March 2025.

Australia's APRA CPS 230 strengthened business continuity and third-party risk requirements for banks, insurers, and superannuation trustees, and commenced on 1 July 2025. Third-party concentration is a recurring theme across all three regimes: paragraph 51 of CPS 230 requires a material service provider register precisely because regulators learned from events like CrowdStrike that shared dependencies are a systemic recovery risk.

ISO 22301 as the certifiable baseline

Where DORA and the UK regime are jurisdiction-specific, ISO 22301:2019 provides a certifiable, internationally recognised structure that disaster recovery plugs into. Clause 8 requires the business impact analysis, recovery strategies, plan development, and validated testing as a connected operational loop.

Certification signals to auditors and customers that recovery capability has been independently assessed, not merely asserted.

Building a defensible DR program: a methodology

A defensible DR program is reproducible. The same inputs and analytic steps produce the same artifacts and decisions every cycle, so that when audit or the board asks how a recovery target was set, the answer is evidence rather than assertion. This section structures that methodology as inputs, analytic steps, artifacts, and decisions.

The value of structuring it this way is that each stage produces something an examiner can inspect, which is exactly what the regulatory regimes above now demand.

Inputs and analytic steps

The inputs are the business impact analysis findings, current dependency maps, the risk assessment, and a defensible service-criticality ranking. Miss any of these and the targets downstream are guesses.

The analytic steps follow the NIST seven-step contingency planning process:

  1. Rank services by impact-over-time using the business impact analysis.
  2. Set RTO and RPO for each service from that ranking, not from IT preference.
  3. Select a recovery strategy per criticality tier (cold, warm, hot, active-active).
  4. Validate that each chosen strategy can actually hit its target by testing it.
  5. Record the actual-versus-target gap and feed it into remediation.

Step four is where most programs break. A strategy that looks adequate on an architecture diagram routinely misses its RTO in a real parallel test, because nobody accounted for the time to validate data integrity or to re-establish a dependency.

Artifacts and decisions

The artifacts are the things that survive scrutiny: DR plans, runbooks, test reports with actual recovery times, remediation logs showing gaps closing, and an evidence pack that ties it all back to the business impact analysis. The DRII Professional Practices frame this artifact discipline as core competency, not paperwork.

The decisions are equally explicit: invocation thresholds (who declares a disaster, on what trigger), how much to spend per recovery tier, and which residual risks the organization formally accepts. Those decisions need named owners and, for material services, board-level accountability. A residual risk that no one signed off on is a residual risk that lands on the CRO at the post-incident review.

Industry-specific DR nuances

The same DR principles produce very different architectures and tolerances across sectors. A near-zero RPO that is non-negotiable in payments would be expensive over-engineering in a back-office reporting context. Constraints, not just principles, drive the design.

Financial services and manufacturing illustrate the spread particularly well.

Financial services

Financial services carries the tightest recovery profiles and the heaviest regulatory load. Payments and trading systems often demand near-zero RPO and sub-hour RTO, because data loss means failed settlement and downtime means regulatory and reputational damage within minutes. DORA Article 12 makes backup and recovery a supervised obligation rather than an internal preference.

Third-party concentration is a board-level DR concern here. When a single cloud region or market-data provider underpins many critical services, its failure becomes a systemic event exactly the risk the post-CrowdStrike regulatory focus is aimed at.

See how Fortiv is built for financial resilience

Manufacturing and operational technology

Manufacturing changes the problem entirely, because operational technology and industrial control systems do not recover like IT. A safety interlock or a physical process state cannot simply be restored from a snapshot, and a careless restore can create a safety hazard rather than resolve an outage. Recovery has to account for the physical world the systems control.

Supply-chain dependency amplifies the blast radius. Allianz data indicates that supply-chain disruptions with global effects now occur roughly every 1.4 years and the trend is rising. The Maersk NotPetya event is the canonical manufacturing-adjacent lesson: when systems went down across 76 terminals, manual fallback procedures, clipboards and phone calls, were what bought time while the rebuild ran. Plants that keep credible manual-fallback procedures recover operationally faster than those that assume the systems will always be there.

See how Fortiv is built for manufacturing resilience

The cost of downtime and the case for DR investment

DR investment is justified by the cost of not recovering fast enough, and that cost is measurable. The business case is rarely about the price of a hot site in the abstract. It is about the price of the outage the hot site prevents.

The numbers from recent events make the trade concrete.

What outages actually cost

A single vendor failure can produce systemic losses. The CrowdStrike outage is estimated to have caused more than $5 billion in losses, with $5.4 billion of Fortune 500 impact, from one faulty content file. That is the downside the DR conversation is actually pricing against.

Breach costs, encouragingly, fall when recovery is fast and automation is mature. The global average cost of a data breach dropped to $4.44 million in 2025, down 9% from $4.88 million the year before, with much of the improvement attributed to faster detection and recovery. And the cost is not only financial. The 2025 BCI horizon scan found that 35.8% of disruptions negatively affect staff morale, wellbeing, and mental health a cost that never shows up on the incident ledger but shapes how well the team performs in the next one.

How disaster recovery fits into enterprise resilience

Disaster recovery is one layer in a wider resilience program, and isolating it is how organizations end up recovering systems while the business still fails around them. DR feeds and is fed by continuity planning, and both sit inside the broader capability that an enterprise needs to absorb and adapt to disruption.

Wiring it together is what turns a set of plans into a capability.

Wiring DR into BCM and resilience governance

DR draws its targets from the business impact analysis and the continuity strategy required by ISO 22301 Clause 8.3, and it feeds test evidence back up into the wider continuity program. Treated as a silo, it optimises for system restoration while missing the cross-dependencies that make an IT incident become a customer-communication failure within an hour.

That cascading quality is the hard part. A modern disruption rarely stays in one lane; a vendor failure becomes a supply failure becomes a regulatory-notification failure, which is why cascading crises demand cascading plans that interlink rather than sit in separate binders. Keeping all of those plans tested, current, and aligned to what the business actually prioritises is the maintenance discipline that separates a real capability from a documented one. The connected approach, from the upstream enterprise resilience view down through disaster recovery planning and into building an IT disaster recovery plan, is what holds when the pager goes off at 3am.

Frequently asked questions

Learn more

See first-hand what AI-native resilience looks like

Fortiv
© Fortiv 2026Legal and Privacy