KRI Series: Business Continuity & Disaster Recovery

The KRIs in this domain measure whether your BC/DR program is a functioning operational capability or a set of documents that provides comfort without assurance. The test is always the same: if your primary datacenter, cloud region, or core SaaS stack failed right now, would your recovery objectives be met? Most organizations answer "we believe so." Fewer have evidence.

If you are standing this up from scratch, start with how to build a KRI program and the consolidated KRI reference library, which maps every domain to one CIS-aligned catalog.

In this guide

KRI inventory
Deriving these KRIs by source type

Framework mapping

CIS Controls v8

The KRIs in this domain implement and measure these CIS Critical Security Controls:

CIS 11, Data Recovery. Backup coverage, integrity verification, and recovery-time validation.

Business continuity is broader than backups and maps more fully to ISO 22301.

KRI inventory

1. Business impact analysis currency

What to measure. Days since the Business Impact Analysis (BIA) was last completed and validated with business unit leadership, covering all critical business functions, their maximum tolerable downtime (MTD), and the upstream and downstream dependencies that affect recovery sequencing.

Why it matters. A BIA from before your last major architecture change, acquisition, or product launch is a BIA for an organization that no longer exists. RTOs and RPOs derived from a stale BIA are assumptions, not commitments. The BIA is the foundation of every BC/DR decision, if it's wrong, everything built on it is wrong.

BIA documentation: last completion date, business unit sign-offs, dependency maps
Change management: major system or architecture changes since last BIA, each is a potential BIA invalidation event
Business unit calendar: BIA review as a recurring item in annual planning cycle
System inventory: critical systems not appearing in BIA, signals coverage gaps

Status	Criteria
Green	BIA completed within 18 months; major architecture changes trigger BIA update within 90 days; all critical business functions covered with validated MTD
Amber	18–36 months; or significant architecture changes since last BIA without update
Red	>36 months; or BIA never formally completed; or critical systems not in BIA scope

2. RTO and RPO definition and validation coverage

What to measure. Percentage of critical systems and services with formally defined Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs), and percentage of those that have been validated through actual recovery testing, not just theoretical calculation.

Why it matters. Unvalidated RTOs are aspirations. Every organization has RTOs. Organizations that have tested them know how realistic they are, and the gap between designed RTO and tested RTO is almost always larger than expected. Regulators, insurers, and auditors increasingly distinguish between documented and validated.

BIA and DR documentation: RTO/RPO per critical system
Recovery test records: test date, tested RTO vs. designed RTO, gaps identified
Backup platform: RPO validation, last confirmed recovery point vs. stated RPO
Cloud DR configuration: Route53 failover timing, Azure Traffic Manager failover, GCP load balancer health check intervals, design-time RTOs for cloud architectures

How to calculate.

Definition coverage: (Critical systems with documented RTO and RPO) ÷ (total critical systems) × 100
Validation coverage: (Critical systems with RTO validated through test within 12 months) ÷ (critical systems with documented RTO) × 100

Status	Criteria
Green	>95% definition coverage; >80% validation coverage; tested RTO within 20% of designed RTO
Amber	80–94% definition; or <60% validation coverage; or tested RTO significantly exceeds designed RTO without remediation
Red	<80% definition; or no validation testing; or critical systems whose tested RTO is more than double their designed RTO

3. Recovery test cadence and success rate

What to measure. Frequency of DR recovery tests, full failover tests, partial recovery tests, and backup restoration tests, and the percentage of tests that meet their stated recovery objectives without requiring unplanned intervention.

Why it matters. Recovery that fails during a test is recoverable. Recovery that fails during an actual incident is catastrophic. Test cadence measures whether you're building recovery muscle. Success rate measures whether your recovery designs actually work at the speed and completeness you've designed them for.

DR test records: test date, scope (full/partial), tested system, outcome, issues identified
Backup platform: restoration test logs, success/failure, time to restore, data completeness verification
Cloud failover records: failover exercise logs from Route53, Azure Traffic Manager, GCP Global Load Balancer
Change management: DR tests scheduled after major architecture changes

How to calculate.

Test cadence: months since last full DR test; months since last partial/component test
Success rate: (Tests meeting RTO/RPO without unplanned intervention) ÷ (total tests in last 12 months) × 100

Status	Criteria
Green	Full DR test within 12 months; partial/component tests quarterly; success rate >85%; failures produce action items tracked to completion
Amber	Full test 12–24 months; or success rate 70–84%; or test failures without action items
Red	No DR testing in >24 months; or success rate <70%; or tests canceled due to operational risk without rescheduling

4. Backup architecture integrity

What to measure. Whether critical data backups are protected against ransomware through immutability, air-gap, or offsite isolation, validated through restoration testing, not assumed from architecture documentation.

Why it matters. Ransomware groups target backup infrastructure specifically because destroying backups maximizes negotiating leverage. Organizations that believe their backups are air-gapped but haven't tested the air-gap recently may be wrong. Immutable storage that was configured but never validated may have been inadvertently disabled by a configuration change. This KRI is one of the most heavily weighted in cyber insurance underwriting.

Backup platform (Veeam, Commvault, Rubrik, Cohesity, Zerto): immutability settings, air-gap configuration, offsite replication status
Cloud backup: AWS Backup Vault Lock (WORM), Azure Blob immutable storage, GCP storage retention policies
Restoration test records: test date, restoration success, data completeness, time to restore
Infrastructure records: backup storage connectivity to production network, a backup that connects nightly for sync is not air-gapped during the sync window

KRI values.

Immutability coverage: (Critical data stores with backup in immutable or air-gapped storage) ÷ (total critical data stores) × 100
Restoration test currency: last successful restoration test date per critical system
Backup-to-production isolation: confirmation that backup systems cannot be encrypted by ransomware reaching production

Status	Criteria
Green	>95% of critical data with immutable or verified air-gapped backup; restoration tested within 90 days; backup storage not reachable from production network during operation
Amber	80–94% immutability coverage; or restoration testing 90–180 days ago; or air-gap implementation relies on process rather than technical enforcement
Red	<80%; or no immutability; or backup storage reachable from production (same ransomware blast radius); or no restoration testing

5. BC plan activation capability

What to measure. Whether the organization can activate business continuity plans, alternate work sites, manual fallback procedures, emergency communication systems, without depending on the infrastructure that a major incident might have taken offline.

Why it matters. BC plans that are stored on the production document management system cannot be accessed if that system is down. Communication trees that rely on corporate email cannot be used if email is part of the incident. BC activation capability measures whether your resilience plans are available when you need them, independently of the systems they're designed to substitute for.

BC plan storage: is the plan accessible independently of production systems? (offline copy, cloud storage outside primary tenant, printed copies at alternate sites)
Emergency communication system: out-of-band communication capability (mass notification system, personal mobile contact list, satellite communication for critical sites)
Alternate work site: documented and tested capability, remote work infrastructure, alternate office arrangements, crisis communication platform
Manual fallback procedures: documented procedures for critical functions that can operate without primary systems

KRI values.

Offline plan access: BC plans accessible without production systems (yes/no)
Out-of-band communication tested: emergency notification system tested within 12 months (yes/no)
Alternate work capability tested: remote work or alternate site capability validated within 12 months (yes/no)

Status	Criteria
Green	All three capabilities present and tested; plan access independent of production systems; emergency communication tested annually
Amber	Plans accessible offline but emergency communication untested; or alternate work capability assumed but not tested
Red	BC plans only accessible through production systems; or no out-of-band communication capability; or no alternate work site or remote capability planning

6. Supply chain and critical vendor recovery dependency

What to measure. Percentage of documented recovery plans that have explicit dependencies on vendor availability, and whether those vendors have provided their own recovery capability documentation or SLAs that are consistent with your recovery objectives.

Why it matters. Recovery plans that depend on vendor response often fail at the vendor dependency. If your recovery plan assumes your cloud provider will restore service in 4 hours, but the provider's SLA is 99.9% uptime with undefined recovery time, there's a gap. If your backup provider is recovering their own infrastructure during the same disaster event, your recovery timeline may extend indefinitely.

DR documentation: vendor dependency mapping per recovery procedure
Vendor SLAs: contractual recovery commitments from critical vendors
Vendor DR capability: vendor DR documentation or attestation (SOC 2 CC9.1, availability commitments)
Concentration analysis: recovery plans that depend on the same cloud region or vendor that experienced the incident

How to calculate. (Critical recovery procedures with vendor dependencies and documented vendor recovery SLA) ÷ (total critical recovery procedures with vendor dependencies) × 100

Status	Criteria
Green	>90% of vendor-dependent recovery procedures with documented vendor SLA; vendor SLAs reviewed at contract renewal; concentration risk identified and mitigation planned
Amber	70–89%; or vendor SLAs not reviewed in last contract cycle; or known concentration risk without mitigation
Red	<70%; or recovery plans with vendor dependencies and no vendor SLA; or recovery plan assumes vendor availability during the same event that triggered recovery

Deriving these KRIs by source type

From Backup Platforms (Veeam, Rubrik, Cohesity, Commvault)

Backup job completion: Dashboard or API showing backup job success rate, last completion time, and data size, calculate RPO gap for any missed jobs
Immutability status: Rubrik immutable snapshots API; Veeam backup copy with immutability settings; Cohesity DataLock
Restoration test logs: Automated test restore feature (Rubrik SureBackup, Veeam Sure Backup, Cohesity DataProtect verified backups), export test results with dates and success flags
Air-gap verification: Network connectivity logs showing when backup storage is connected vs. isolated

From Cloud Providers (AWS, Azure, GCP)

AWS Backup Vault Lock: aws backup describe-backup-vault --backup-vault-name <vault-name> → LockConfiguration field, WORM protection enabled
RTO validation, Route53: Failover routing policy health check configuration; test by disabling primary endpoint health check and measuring failover time in test environment
Azure Site Recovery: Replication health report; recovery plan test execution records with testDate and testStatus fields via ASR API
GCP Cloud Storage retention: gsutil retention get gs://bucket-name, verify retention policy prevents deletion

From Business Units

BIA validation: Interview or survey process, business unit leaders confirm MTD and dependencies annually; changes in response signal BIA staleness
Manual fallback testing: Exercises where business units operate without primary systems for defined period, validates manual procedure documentation
Communication tree test: Annual out-of-band communication drill, measures reach percentage and latency

From IT Infrastructure

Failover timing tests: DNS TTL validation; database failover timing (PostgreSQL replication lag, SQL Server Always On failover time); load balancer health check intervals, all affect actual RTO
Recovery sequence validation: Systems with interdependencies need recovery in correct order, test that recovery sequencing documentation matches actual dependency graph
Cross-region replication lag: Cloud database replication lag (AWS Aurora Global Database lag; Azure SQL Geo-Replication lag), directly determines achieved RPO

Draxis turns these KRIs into a live signal

Draxis connects to the tools you already run (backup platforms, DR orchestration, and cloud recovery tooling) and computes these BC/DR KRIs automatically, with the green/amber/red bands, trend lines, and drift alerts described above. No spreadsheets, no manual stitching.

See how Draxis reads your stack →