The KRIs in this domain measure whether your BC/DR program is a functioning operational capability or a set of documents that provides comfort without assurance. The test is always the same: if your primary datacenter, cloud region, or core SaaS stack failed right now, would your recovery objectives be met? Most organizations answer "we believe so." Fewer have evidence.
If you are standing this up from scratch, start with how to build a KRI program and the consolidated KRI reference library, which maps every domain to one CIS-aligned catalog.
KRI inventory
1. Business impact analysis currency
What to measure. Days since the Business Impact Analysis (BIA) was last completed and validated with business unit leadership, covering all critical business functions, their maximum tolerable downtime (MTD), and the upstream and downstream dependencies that affect recovery sequencing.
Why it matters. A BIA from before your last major architecture change, acquisition, or product launch is a BIA for an organization that no longer exists. RTOs and RPOs derived from a stale BIA are assumptions, not commitments. The BIA is the foundation of every BC/DR decision, if it's wrong, everything built on it is wrong.
- BIA documentation: last completion date, business unit sign-offs, dependency maps
- Change management: major system or architecture changes since last BIA, each is a potential BIA invalidation event
- Business unit calendar: BIA review as a recurring item in annual planning cycle
- System inventory: critical systems not appearing in BIA, signals coverage gaps
| Status | Criteria |
|---|---|
| Green | BIA completed within 18 months; major architecture changes trigger BIA update within 90 days; all critical business functions covered with validated MTD |
| Amber | 18–36 months; or significant architecture changes since last BIA without update |
| Red | >36 months; or BIA never formally completed; or critical systems not in BIA scope |
2. RTO and RPO definition and validation coverage
What to measure. Percentage of critical systems and services with formally defined Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs), and percentage of those that have been validated through actual recovery testing, not just theoretical calculation.
Why it matters. Unvalidated RTOs are aspirations. Every organization has RTOs. Organizations that have tested them know how realistic they are, and the gap between designed RTO and tested RTO is almost always larger than expected. Regulators, insurers, and auditors increasingly distinguish between documented and validated.
- BIA and DR documentation: RTO/RPO per critical system
- Recovery test records: test date, tested RTO vs. designed RTO, gaps identified
- Backup platform: RPO validation, last confirmed recovery point vs. stated RPO
- Cloud DR configuration: Route53 failover timing, Azure Traffic Manager failover, GCP load balancer health check intervals, design-time RTOs for cloud architectures
How to calculate.
- Definition coverage: (Critical systems with documented RTO and RPO) ÷ (total critical systems) × 100
- Validation coverage: (Critical systems with RTO validated through test within 12 months) ÷ (critical systems with documented RTO) × 100
| Status | Criteria |
|---|---|
| Green | >95% definition coverage; >80% validation coverage; tested RTO within 20% of designed RTO |
| Amber | 80–94% definition; or <60% validation coverage; or tested RTO significantly exceeds designed RTO without remediation |
| Red | <80% definition; or no validation testing; or critical systems whose tested RTO is more than double their designed RTO |
3. Recovery test cadence and success rate
What to measure. Frequency of DR recovery tests, full failover tests, partial recovery tests, and backup restoration tests, and the percentage of tests that meet their stated recovery objectives without requiring unplanned intervention.
Why it matters. Recovery that fails during a test is recoverable. Recovery that fails during an actual incident is catastrophic. Test cadence measures whether you're building recovery muscle. Success rate measures whether your recovery designs actually work at the speed and completeness you've designed them for.
- DR test records: test date, scope (full/partial), tested system, outcome, issues identified
- Backup platform: restoration test logs, success/failure, time to restore, data completeness verification
- Cloud failover records: failover exercise logs from Route53, Azure Traffic Manager, GCP Global Load Balancer
- Change management: DR tests scheduled after major architecture changes
How to calculate.
- Test cadence: months since last full DR test; months since last partial/component test
- Success rate: (Tests meeting RTO/RPO without unplanned intervention) ÷ (total tests in last 12 months) × 100
| Status | Criteria |
|---|---|
| Green | Full DR test within 12 months; partial/component tests quarterly; success rate >85%; failures produce action items tracked to completion |
| Amber | Full test 12–24 months; or success rate 70–84%; or test failures without action items |
| Red | No DR testing in >24 months; or success rate <70%; or tests canceled due to operational risk without rescheduling |
4. Backup architecture integrity
What to measure. Whether critical data backups are protected against ransomware through immutability, air-gap, or offsite isolation, validated through restoration testing, not assumed from architecture documentation.
Why it matters. Ransomware groups target backup infrastructure specifically because destroying backups maximizes negotiating leverage. Organizations that believe their backups are air-gapped but haven't tested the air-gap recently may be wrong. Immutable storage that was configured but never validated may have been inadvertently disabled by a configuration change. This KRI is one of the most heavily weighted in cyber insurance underwriting.
- Backup platform (Veeam, Commvault, Rubrik, Cohesity, Zerto): immutability settings, air-gap configuration, offsite replication status
- Cloud backup: AWS Backup Vault Lock (WORM), Azure Blob immutable storage, GCP storage retention policies
- Restoration test records: test date, restoration success, data completeness, time to restore
- Infrastructure records: backup storage connectivity to production network, a backup that connects nightly for sync is not air-gapped during the sync window
KRI values.
- Immutability coverage: (Critical data stores with backup in immutable or air-gapped storage) ÷ (total critical data stores) × 100
- Restoration test currency: last successful restoration test date per critical system
- Backup-to-production isolation: confirmation that backup systems cannot be encrypted by ransomware reaching production
| Status | Criteria |
|---|---|
| Green | >95% of critical data with immutable or verified air-gapped backup; restoration tested within 90 days; backup storage not reachable from production network during operation |
| Amber | 80–94% immutability coverage; or restoration testing 90–180 days ago; or air-gap implementation relies on process rather than technical enforcement |
| Red | <80%; or no immutability; or backup storage reachable from production (same ransomware blast radius); or no restoration testing |
5. BC plan activation capability
What to measure. Whether the organization can activate business continuity plans, alternate work sites, manual fallback procedures, emergency communication systems, without depending on the infrastructure that a major incident might have taken offline.
Why it matters. BC plans that are stored on the production document management system cannot be accessed if that system is down. Communication trees that rely on corporate email cannot be used if email is part of the incident. BC activation capability measures whether your resilience plans are available when you need them, independently of the systems they're designed to substitute for.
- BC plan storage: is the plan accessible independently of production systems? (offline copy, cloud storage outside primary tenant, printed copies at alternate sites)
- Emergency communication system: out-of-band communication capability (mass notification system, personal mobile contact list, satellite communication for critical sites)
- Alternate work site: documented and tested capability, remote work infrastructure, alternate office arrangements, crisis communication platform
- Manual fallback procedures: documented procedures for critical functions that can operate without primary systems
KRI values.
- Offline plan access: BC plans accessible without production systems (yes/no)
- Out-of-band communication tested: emergency notification system tested within 12 months (yes/no)
- Alternate work capability tested: remote work or alternate site capability validated within 12 months (yes/no)
| Status | Criteria |
|---|---|
| Green | All three capabilities present and tested; plan access independent of production systems; emergency communication tested annually |
| Amber | Plans accessible offline but emergency communication untested; or alternate work capability assumed but not tested |
| Red | BC plans only accessible through production systems; or no out-of-band communication capability; or no alternate work site or remote capability planning |
6. Supply chain and critical vendor recovery dependency
What to measure. Percentage of documented recovery plans that have explicit dependencies on vendor availability, and whether those vendors have provided their own recovery capability documentation or SLAs that are consistent with your recovery objectives.
Why it matters. Recovery plans that depend on vendor response often fail at the vendor dependency. If your recovery plan assumes your cloud provider will restore service in 4 hours, but the provider's SLA is 99.9% uptime with undefined recovery time, there's a gap. If your backup provider is recovering their own infrastructure during the same disaster event, your recovery timeline may extend indefinitely.
- DR documentation: vendor dependency mapping per recovery procedure
- Vendor SLAs: contractual recovery commitments from critical vendors
- Vendor DR capability: vendor DR documentation or attestation (SOC 2 CC9.1, availability commitments)
- Concentration analysis: recovery plans that depend on the same cloud region or vendor that experienced the incident
How to calculate. (Critical recovery procedures with vendor dependencies and documented vendor recovery SLA) ÷ (total critical recovery procedures with vendor dependencies) × 100
| Status | Criteria |
|---|---|
| Green | >90% of vendor-dependent recovery procedures with documented vendor SLA; vendor SLAs reviewed at contract renewal; concentration risk identified and mitigation planned |
| Amber | 70–89%; or vendor SLAs not reviewed in last contract cycle; or known concentration risk without mitigation |
| Red | <70%; or recovery plans with vendor dependencies and no vendor SLA; or recovery plan assumes vendor availability during the same event that triggered recovery |
Deriving these KRIs by source type
From Backup Platforms (Veeam, Rubrik, Cohesity, Commvault)
- Backup job completion: Dashboard or API showing backup job success rate, last completion time, and data size, calculate RPO gap for any missed jobs
- Immutability status: Rubrik immutable snapshots API; Veeam backup copy with immutability settings; Cohesity DataLock
- Restoration test logs: Automated test restore feature (Rubrik SureBackup, Veeam Sure Backup, Cohesity DataProtect verified backups), export test results with dates and success flags
- Air-gap verification: Network connectivity logs showing when backup storage is connected vs. isolated
From Cloud Providers (AWS, Azure, GCP)
- AWS Backup Vault Lock:
aws backup describe-backup-vault --backup-vault-name <vault-name>→LockConfigurationfield, WORM protection enabled - RTO validation, Route53: Failover routing policy health check configuration; test by disabling primary endpoint health check and measuring failover time in test environment
- Azure Site Recovery: Replication health report; recovery plan test execution records with
testDateandtestStatusfields via ASR API - GCP Cloud Storage retention:
gsutil retention get gs://bucket-name, verify retention policy prevents deletion
From Business Units
- BIA validation: Interview or survey process, business unit leaders confirm MTD and dependencies annually; changes in response signal BIA staleness
- Manual fallback testing: Exercises where business units operate without primary systems for defined period, validates manual procedure documentation
- Communication tree test: Annual out-of-band communication drill, measures reach percentage and latency
From IT Infrastructure
- Failover timing tests: DNS TTL validation; database failover timing (PostgreSQL replication lag, SQL Server Always On failover time); load balancer health check intervals, all affect actual RTO
- Recovery sequence validation: Systems with interdependencies need recovery in correct order, test that recovery sequencing documentation matches actual dependency graph
- Cross-region replication lag: Cloud database replication lag (AWS Aurora Global Database lag; Azure SQL Geo-Replication lag), directly determines achieved RPO
Draxis turns these KRIs into a live signal
Draxis connects to the tools you already run (backup platforms, DR orchestration, and cloud recovery tooling) and computes these BC/DR KRIs automatically, with the green/amber/red bands, trend lines, and drift alerts described above. No spreadsheets, no manual stitching.
See how Draxis reads your stack →