KRI Series: Security Operations

Every other security domain feeds the SOC. Identity, endpoint, network, and application controls all generate signals that someone has to detect, triage, and act on. When the SOC is not working, the spend on every one of those controls is partly wasted, because nobody is watching the alerts they produce.

The KRIs below measure three things: how well you detect, how fast you respond, and whether the operational plumbing (SIEM tuning, log coverage, analyst capacity) holds up. A SOC that fires 50,000 alerts a day can still be blind to the attack that matters. A quiet console can be quiet because nothing is watching. These indicators tell you which one you have. For the full catalog with thresholds mapped to CIS Controls, see the KRI reference library, and if you are standing up measurement from scratch, start with how to build a KRI program.

In this guide

KRI inventory
Deriving these KRIs by source type

Framework mapping

CIS Controls v8

The KRIs in this domain implement and measure these CIS Critical Security Controls:

CIS 8, Audit Log Management. Critical-asset log coverage and SIEM ingestion completeness.
CIS 13, Network Monitoring and Defense. Network monitoring coverage, alert fidelity, and detection-source coverage.
CIS 17, Incident Response Management. MTTD, MTTC, and playbook coverage.
CIS 10, Malware Defenses. Endpoint malware detection feeding the SOC.

KRI inventory

Eight indicators, each with what to measure, why it matters, where the data comes from, how to calculate it, and green/amber/red thresholds. The thresholds are starting points, tune them to your environment and threat model.

1. Alert-to-investigation ratio (alert fidelity)

What to measure. The share of security alerts that result in actual analyst investigation, expressed as the percentage of alerts not auto-closed, suppressed, or discarded without human review.

Why it matters. Alert fatigue is the leading cause of SOC analyst burnout and of missed detections. A SIEM generating 50,000 alerts a day with 99% auto-closed still leaves 500 for review. If that is too many, analysts start ignoring the queue. This ratio tells you whether your alerting produces useful signal or noise.

Derivation sources:

SIEM platform (Splunk Enterprise Security, Microsoft Sentinel, IBM QRadar, Elastic SIEM): alert volume and disposition breakdown (auto-closed, investigated, escalated, false_positive)
SOAR platform (Palo Alto XSOAR, Splunk SOAR, Tines): playbook execution logs showing auto-disposition vs. human review
Ticketing system (ServiceNow, Jira, PagerDuty): tickets created from security alerts vs. total alert volume
MDR provider reports: escalation rate vs. total alert volume, if you use an MDR

How to calculate. (Alerts resulting in human investigation) ÷ (total alerts generated) × 100. Also track false positive rate: (alerts investigated and closed as false positive) ÷ (total investigated alerts) × 100.

Status	Criteria
Green	>10% of alerts result in meaningful investigation; false positive rate <30% of investigated alerts
Amber	2–10% fidelity rate; or false positive rate 30–60%
Red	<2% fidelity rate (noise-dominated); or false positive rate >60% (analysts have stopped trusting alerts)

2. Detection coverage rate

What to measure. The percentage of MITRE ATT&CK tactics and techniques relevant to your threat model that have at least one active detection rule or analytic in your SIEM or detection platform.

Why it matters. Detection coverage gaps are blind spots. An attacker using a technique you have no detection for can operate indefinitely without alerting. Mapping your detection rules to ATT&CK shows which attack paths are visible and which are not, which is the foundation of any mature detection engineering program.

Derivation sources:

SIEM rule inventory: export all active detection rules and map to ATT&CK technique IDs (many SIEMs support ATT&CK tagging natively)
ATT&CK Navigator: visual mapping of covered vs. uncovered techniques
Detection engineering platforms (Sigma rules, Microsoft Sentinel Analytics, Chronicle YARA-L): rule libraries with ATT&CK mappings
EDR platform: built-in detections mapped to ATT&CK (CrowdStrike Falcon, SentinelOne)
Threat intelligence: your threat model, which actors and techniques are most relevant to your industry

How to calculate. (ATT&CK techniques with at least one active, tested detection) ÷ (total ATT&CK techniques in scope for your threat model) × 100.

Status	Criteria
Green	>70% coverage of threat-model-relevant techniques; coverage mapped and reviewed quarterly
Amber	40–69% coverage; or coverage mapped but not validated through purple team testing
Red	<40% coverage; or no ATT&CK-mapped detection inventory; or detection rules deployed but never tested

3. Mean time to detect (MTTD) by severity

What to measure. Average hours from the earliest log evidence of attack behavior to the creation of an alert or incident ticket, segmented by incident severity (critical, high, medium).

Why it matters. MTTD is the most predictive single metric for breach cost. It measures the gap between when an attacker starts doing damage and when your team knows about it. Segmenting by severity shows whether detection is tuned for the right priority threats. For watching this trend slide before it becomes an incident, see reading the signal on drifting KRIs.

Derivation sources:

SIEM: first-indicator-of-compromise log timestamp vs. alert creation timestamp, per closed incident
Incident management platform: incident timeline fields (event_start, detected, contained), which require post-incident review discipline to populate accurately
EDR (Microsoft Defender for Endpoint): process start time of malicious activity vs. detection or alert timestamp
SOAR: automated incident timeline reconstruction from log correlation

How to calculate. For each closed incident: (detection timestamp) − (earliest confirmed event timestamp from forensic review). Average by severity tier.

Status	Criteria
Green	Critical <1 hour; high <4 hours; medium <24 hours
Amber	Critical 1–4 hours; high 4–24 hours
Red	Critical >4 hours; or no MTTD measurement capability (no post-incident log correlation)

4. SIEM log source coverage and health

What to measure. The percentage of expected log sources actively ingesting into the SIEM within the defined freshness window, plus the rate of log source health failures: sources that stop sending logs without alerting the security team.

Why it matters. A SIEM is only as good as the data feeding it. Log sources fail silently. A server stops sending syslog, a cloud trail stops after a config change, a network device drops connection, and detection goes blind for that source with no alert. Without proactive coverage monitoring, you discover the gap during post-incident forensic reconstruction, not before.

The silent failure that costs you the incident

The most expensive log gap is the one you find while writing the post-incident report. A source that quietly stopped reporting three weeks ago means the detection that depended on it never had a chance to fire. Alert on source disconnection, not just on the detections themselves.

Derivation sources:

SIEM platform (Splunk Enterprise Security): source inventory config vs. sources with events in the last 24 hours; data freshness dashboard
Expected source inventory: CMDB critical asset list cross-referenced against the SIEM source list
Log management platform (Cribl, Logstash, Fluentd): pipeline health metrics, source disconnection events
Cloud logging services (CloudTrail, Azure Monitor, GCP Cloud Logging): log delivery health checks

How to calculate. Coverage: (active SIEM sources with events in last 24h) ÷ (expected sources from critical asset inventory) × 100. Health: (log source disconnection events per month with alert generated) ÷ (total disconnection events) × 100.

Status	Criteria
Green	>99% of critical sources actively ingesting; source disconnection alerting active; gaps investigated within 1 hour
Amber	95–98% active; or source loss detection latency >4 hours
Red	<95% active; or no monitoring for source loss; or critical sources offline discovered during an incident

5. Mean time to contain (MTTC)

What to measure. Average hours from incident detection to confirmed containment: attacker access removed, malware isolated, lateral movement stopped.

Why it matters. Detection without fast containment still produces large losses. The gap between detection and containment is the window during which data exfiltrates, ransomware spreads, and persistence gets established. MTTC is the second half of the damage equation after MTTD.

Derivation sources:

Incident management platform: containment action timestamps per incident (needs a defined containment_confirmed state in the workflow)
EDR (Microsoft Defender for Endpoint, CrowdStrike Falcon): endpoint isolation timestamps vs. alert creation timestamps
Identity platform: account disable timestamps for compromised accounts vs. detection
Network: firewall or NAC block rule creation for isolating compromised segments
SOAR: automated containment playbook execution timestamps

How to calculate. For each incident: (containment confirmed timestamp) − (detection timestamp). Average by severity tier.

Status	Criteria
Green	Critical <2 hours; high <8 hours
Amber	Critical 2–8 hours; high 8–24 hours
Red	Critical >8 hours; or containment not tracked as a discrete milestone

6. Playbook coverage rate

What to measure. The percentage of common incident types with a documented, tested response playbook covering escalation paths, containment actions, communication templates, and evidence preservation.

Why it matters. Incident response without playbooks degrades to improvisation under pressure. Playbooks encode institutional knowledge, cut response time by removing decisions during triage, and keep actions consistent across shifts and analysts. Coverage rate measures whether your most common incident types have the operational support they need.

Derivation sources:

SOAR platform: automated playbook inventory and activation rates
Incident tracking system: incident type taxonomy vs. playbook existence
SOC runbook documentation (Confluence, internal wiki): manual playbook library with last-tested dates
Incident history: trailing 12-month incident type frequency. High-frequency types with no playbook are the highest-priority gaps

How to calculate. (Incident types with a documented and tested playbook) ÷ (total incident types in taxonomy, weighted by frequency) × 100.

Status	Criteria
Green	>90% of incident types by frequency covered; playbooks tested at least annually; automated playbooks in SOAR for the top 5 incident types
Amber	70–89% by frequency; or playbooks documented but not tested
Red	<70% by frequency; or no playbooks for the top 3 most frequent incident types

7. Analyst capacity utilization rate

What to measure. Analyst productive time on investigation and response as a percentage of total available analyst hours, a measure of whether SOC capacity matches the alert and incident workload.

Why it matters. Overloaded analysts miss alerts, cut corners on investigations, and churn. Underutilized analysts represent misallocated investment. Neither extreme produces good security outcomes. Capacity utilization is the leading indicator of SOC burnout and the operational number that justifies staffing decisions.

Derivation sources:

SIEM and SOAR: mean time per alert investigation × alert volume = estimated investigation hours
Ticketing system: ticket open and close rates and time-in-progress per analyst
Shift schedules: available analyst hours per period
Time tracking, if used: actual investigation time logs
Analyst feedback: qualitative signal on workload through regular 1:1s or surveys

How to calculate. Target utilization is 60–70% of analyst time on active investigation, leaving 30–40% for tuning, training, and documentation. Alert load per analyst: total alerts requiring human review ÷ analyst headcount per shift. Investigation completion rate: percentage of alerts triaged within the defined SLA.

Status	Criteria
Green	60–75% investigation utilization; <10% of alerts missing triage SLA
Amber	>80% utilization (overloaded) or <40% (underutilized, often a sign of alert fatigue or suppression); or >20% of alerts missing SLA
Red	Systematic SLA breaches; analyst attrition above industry baseline; or no capacity measurement

8. Threat intelligence integration rate

What to measure. The percentage of active threat intelligence feeds whose indicators (IOCs) are operationally integrated into detection tooling (firewalls, SIEM, EDR, email gateway) with automated blocking or alerting on matched indicators.

Why it matters. Threat intelligence that lives in a portal but is not wired into detection tools is intelligence nobody is acting on. Integration rate measures whether your threat intel spend produces operational outcomes or just adds context for analysts who are already overloaded.

Derivation sources:

Threat intelligence platform (Recorded Future, Intel 471, MISP, OpenCTI): feed inventory and indicator export status
SIEM: threat intelligence indicator match rules, confirming whether feeds contribute to detection
Firewall or proxy: IP and domain block lists sourced from threat intelligence
EDR: threat intelligence integration for IOC matching
SOAR: automated IOC enrichment in incident investigation playbooks

How to calculate. (TI feeds with automated indicator distribution to at least one detection tool) ÷ (total TI feeds subscribed) × 100. Also track IOC match rate: active detections triggered by threat intelligence IOCs per month.

Status	Criteria
Green	>90% of TI feeds producing integrated IOCs to detection tools; IOC match rate tracked monthly
Amber	60–89% integrated; or TI feeds in platform but analyst-driven rather than automated
Red	<60% integrated; or TI subscriptions with no operational integration; or IOC match rate unknown

Deriving these KRIs by source type

Where each signal comes from, and the query patterns that get you there. The exact syntax depends on your platform, but the shape of the calculation carries across tools.

From SIEM platforms (Splunk, Microsoft Sentinel, Elastic, QRadar)

Alert volume and disposition: index=alerts | stats count by status | where status IN ("investigated","auto-closed","false_positive"), adjusted to your SIEM query language. See the Splunk Enterprise Security setup guide.
Log source health: built-in data freshness dashboards (Sentinel: Heartbeat table; Splunk: | tstats count where index=* by host compared to the expected source list)
MTTD calculation: join alert creation time with earliest correlated event time, which needs accurate incident timeline logging or SOAR enrichment
ATT&CK coverage: tag detection rules with ATT&CK technique IDs, then | stats dc(technique_id) as covered_techniques against the total in-scope technique count

From SOAR platforms (Palo Alto XSOAR, Splunk SOAR, Tines, Cortex)

Playbook activation rates: which playbooks run most often, identifying the highest-volume incident types
Automated vs. human action ratio: actions taken by playbook vs. actions requiring an analyst decision
Containment action timestamps: automated isolation and block actions carry precise timestamps for MTTC calculation
Analyst workload: playbook escalation rate to a human analyst, by playbook type

From EDR and XDR platforms (CrowdStrike Falcon, SentinelOne, Microsoft Defender)

Detection-to-isolation time: endpoint isolation timestamp vs. detection alert creation, an MTTC component (CrowdStrike Falcon, Microsoft Defender for Endpoint)
ATT&CK technique coverage: built-in ATT&CK mapping in detection events (Falcon Intelligence, SentinelOne MITRE report)
Alert volume by technique: high-volume, low-fidelity techniques that need tuning vs. low-volume, high-fidelity detections
Agent health: reporting agents vs. total managed endpoints, which feeds the coverage KRI

From threat intelligence platforms (Recorded Future, MISP, OpenCTI)

Feed freshness: indicator last-updated timestamps, since stale TI feeds produce outdated IOCs
IOC distribution audit: which feeds are configured to export to SIEM, firewall, and EDR, giving integration coverage
Match rate: cross-reference TI IOCs against SIEM logs, for example | lookup threat_intel_iocs dest_ip as dest_ip OUTPUT confidence in Splunk
Threat actor tracking: coverage of threat actors relevant to your industry vertical, as a gap analysis

From incident management (PagerDuty, ServiceNow, Jira)

Incident timeline fields: populating event_start, detected, contained, and resolved timestamps takes discipline; automate via SOAR integration where possible
SLA compliance: configure SLA timers in ServiceNow or Jira for MTTD and MTTC targets, and track the breach rate
Incident type taxonomy: standardize incident classification to enable the playbook coverage calculation
Post-incident review completion rate: percentage of incidents above the severity threshold with a completed PIR, which feeds back into detection improvement

These signals do not live in one place, which is why deriving them by hand means weeks of stitching across the SIEM, SOAR, EDR, and ticketing systems. The point of tracking them as KRIs is to make the stitching continuous instead of a once-a-quarter scramble. Related domains share many of these sources: see the sibling guides on enterprise security KRIs, security architecture KRIs, and identity and access management KRIs.

See your SOC KRIs without the manual stitching

Draxis pulls MTTD, MTTC, detection coverage, alert fidelity, and log source health straight from your SIEM, SOAR, and EDR, and tracks them in real time so you know when detection or response is drifting before it costs you an incident.

Request access →