KRI Series: Red Team & Offensive Security Program

These are also the KRIs that separate security programs that know they have controls from security programs that have verified those controls work against real adversary techniques.

If you are standing this up from scratch, start with how to build a KRI program and the consolidated KRI reference library, which maps every domain to one CIS-aligned catalog.

In this guide

KRI inventory
Deriving these KRIs by source type

Framework mapping

CIS Controls v8

The KRIs in this domain implement and measure these CIS Critical Security Controls:

CIS 18, Penetration Testing. Operation cadence and scope, finding remediation, and detection coverage against simulated TTPs.

KRI inventory

1. Red team operation cadence and scope coverage

What to measure. Frequency of red team operations (full-scope adversarial simulation engagements) across the past 24 months, and the percentage of the organization's critical attack surface included in scope: corporate network, cloud environments, external attack surface, insider threat scenarios, and physical access.

Why it matters. A red team that only tests the corporate network leaves cloud, OT, and physical attack vectors untested. Scope coverage measures whether the offensive security program is generating confidence across the actual attack surface or only in the segments most convenient to test.

Red team operation records: engagement dates, scope, objectives, methodology (black box, assumed breach, purple team)
Engagement scope documentation: network segments, cloud accounts, applications, and physical locations included
Red team roadmap: planned future engagements vs. coverage gaps

How to calculate.

Cadence: months since last full-scope red team operation
Scope coverage: (Critical attack surface areas covered in last 24 months) ÷ (total defined critical attack surface areas) × 100

Status	Criteria
Green	Full-scope operation within 18 months; all critical surface areas covered in trailing 24 months; assumed-breach scenario tested in last 24 months
Amber	18–36 months since full-scope operation; or coverage gaps in cloud or physical domains
Red	>36 months; or only point-in-time penetration tests (no full adversary simulation); or scope consistently limited to network perimeter

2. Critical finding remediation rate and speed

What to measure. Percentage of critical and high findings from red team operations and penetration tests remediated within their committed SLA, and mean days from finding identification to confirmed remediation.

Why it matters. The only value an offensive finding produces is the improvement it creates. Finding remediation rate is the outcome metric for the entire offensive security investment. Findings that sit in a backlog for quarters signal either that the remediation process is broken or that the findings aren't being taken seriously.

Finding tracking system (PlexTrac, Dradis, Jira with security project): findings from all offensive engagements with creation date, severity, owner, and resolution date
Engineering/IT ticketing: remediation tickets linked to offensive findings
Re-test records: retests confirming remediation rather than closure on the word of the remediating team

How to calculate. (Critical findings remediated within 30 days + High findings remediated within 60 days) ÷ (total findings past respective SLA deadline) × 100 Mean time to remediate: average days from identification to confirmed remediation by severity

Status	Criteria
Green	>90% of critical findings remediated within 30 days; re-testing conducted to confirm; tracking system current
Amber	70–89%; or findings tracked but closure unconfirmed through retest
Red	<70%; or critical findings open >60 days; or no finding tracking system; or findings routinely marked closed without remediation

3. Detection and response coverage against red team TTPs

What to measure. Percentage of techniques used by the red team (mapped to MITRE ATT&CK) that were detected by the blue team during the engagement, segmented into: detected and alerted (true positive), detected but no alert (logged but not alerted), and undetected (blind spot).

Why it matters. Red team operations that measure only whether an objective was achieved (e.g., "reached domain admin") tell you the outcome but not the detection gaps. TTP detection coverage tells you exactly which techniques your SOC can and cannot see, which is actionable for detection engineering in a way that "red team succeeded" is not.

Red team report: ATT&CK technique mapping for all techniques used during the engagement
SIEM/EDR: blue team log review post-engagement, which techniques produced alerts, which produced logs without alerts, which produced nothing
Purple team debrief: joint red/blue team review mapping each technique to detection outcome
ATT&CK Navigator: coverage visualization showing detected vs. undetected techniques

How to calculate.

Detected and alerted: (Techniques that triggered alert investigated by SOC) ÷ (total techniques used) × 100
Logged but not alerted: (Techniques visible in logs but producing no alert) ÷ (total techniques used) × 100
Blind spots: (Techniques with no evidence in any log source) ÷ (total techniques used) × 100

Status	Criteria
Green	>60% detected and alerted; blind spots <20%; detection gaps produce new detection rules within 30 days post-engagement
Amber	40–59% detected; or blind spots 20–40%; or detection gap remediation not tracked
Red	<40% detected; or blind spots >40%; or no measurement of blue team detection during red team operations

4. Purple team exercise cadence

What to measure. Frequency of purple team exercises, collaborative sessions where red team executes specific techniques while blue team actively monitors and tunes detection in real time, covering high-priority ATT&CK techniques relevant to the organization's threat model.

Why it matters. Purple team exercises are the highest-efficiency mechanism for improving detection capability. Rather than the extended timeline of a full red team engagement (execute → debrief → find gaps → fix), purple team produces detection improvements in the same session. Organizations that run purple team regularly build detection capability faster than those that only run annual red team.

Purple team exercise records: dates, ATT&CK techniques covered, detection improvements produced
Detection engineering backlog: new rules created from purple team exercise vs. total rules created
ATT&CK technique priority list: highest-priority techniques for your threat model, purple team coverage against this list

How to calculate.

Cadence: purple team exercises per year
Technique coverage: (ATT&CK priority techniques covered in purple team in last 12 months) ÷ (total priority techniques) × 100
Detection improvement rate: new detection rules created per purple team exercise

Status	Criteria
Green	≥4 purple team exercises per year; >50% of ATT&CK priority techniques covered annually; detection rules created within 1 week post-exercise
Amber	2–3 exercises per year; or technique coverage <30%; or detection improvements delayed >30 days
Red	No purple team program; or red team operations without blue team measurement component

5. External attack surface penetration test coverage

What to measure. Percentage of internet-facing applications, APIs, and infrastructure that have been covered by an external penetration test within the last 12 months, and the rate of critical findings on first engagement versus re-tests (a measure of whether remediations hold).

Why it matters. External penetration testing validates that your perimeter defenses hold against realistic attacker techniques. Coverage rate measures whether the program is systematic or selective. First-engagement versus re-test critical finding rate measures whether remediations are complete and durable, or whether the same issues re-emerge.

Penetration test scope documents: external assets included in each engagement
Asset inventory: internet-facing assets, APIs, and applications, compare against pen test scope coverage
Retest reports: confirmed remediation vs. finding re-emergence in retest

How to calculate. (Internet-facing assets covered in external pen test in last 12 months) ÷ (total in-scope internet-facing assets) × 100 Re-emergence rate: (Critical findings from retests where issue persists despite claimed remediation) ÷ (total retested critical findings) × 100

Status	Criteria
Green	>90% of internet-facing assets covered annually; re-emergence rate <10%; scope includes authenticated testing and API endpoints
Amber	70–89% coverage; or scope missing API surface; or retests not conducted
Red	<70%; or external pen test limited to basic unauthenticated scan; or no external pen testing in last 24 months

6. Social engineering testing coverage

What to measure. Whether the offensive security program includes social engineering simulations beyond email phishing, specifically vishing (phone), physical pretexting, and tailgating, and the resistance rates of high-risk cohorts (help desk, finance, executives) in these scenarios.

Why it matters. Most security awareness programs test email phishing exclusively. The MGM breach started with a vishing call to the help desk. Business email compromise executions frequently involve phone calls to finance. Physical pretexting enables badge cloning. Testing only email leaves the majority of social engineering attack surface unmeasured.

Social engineering test records: vishing scripts, call results, physical pretexting reports
Red team engagement reports: social engineering component results
Help desk: call log review for unusual account reset requests post-vishing exercise
Physical security audit: tailgating test results from physical penetration assessments

How to calculate. (Social engineering test scenarios completed in trailing 12 months) ÷ (defined social engineering scenario types in program) × 100 Resistance rates per scenario type and per high-risk cohort

Status	Criteria
Green	Email, vishing, and physical scenarios tested annually; high-risk cohort (help desk, finance, executives) tested specifically; resistance rates tracked and improving
Amber	Email phishing only; or vishing tested but not high-risk cohorts specifically
Red	No social engineering testing beyond email phishing; or no testing for help desk or finance cohorts

7. Assumed breach scenario coverage

What to measure. Whether the red team program includes assumed-breach scenarios, engagements starting from an insider position, a compromised endpoint, or a low-privilege cloud credential, testing lateral movement, privilege escalation, and detection capability from within the environment rather than only from the perimeter.

Why it matters. External perimeter tests validate whether an attacker can get in. Assumed-breach tests validate what happens after they do. The most damaging security incidents, ransomware, insider threat, supply chain compromise, involve attackers who are already inside when the real damage begins. Detection and response capability against post-compromise activity is often worse than detection of initial access.

Red team engagement records: engagement type classification (black-box external, gray-box, assumed-breach, insider threat simulation)
Assumed-breach objectives: lateral movement to domain admin, cloud privilege escalation, data exfiltration from assumed position
Detection measurement: blue team detection rate during assumed-breach operations, the most relevant measure of internal detection capability

Status	Criteria
Green	Assumed-breach scenario conducted within 18 months; detection rate measured during operation; cloud and on-premises both in scope
Amber	Assumed-breach conducted >24 months ago; or scope limited to on-premises; or detection not measured
Red	No assumed-breach testing; red team program limited to external penetration perspective only

Deriving these KRIs by source type

From Penetration Test and Red Team Management Platforms (PlexTrac, Dradis)

Finding export API: All findings with severity, status, created date, closed date, retested date → feed SLA compliance calculation
Scope tracking: Asset list per engagement; compare across engagements to calculate coverage trend
Finding lifecycle: Track finding from creation through remediation to retest confirmation, completion without retest = unvalidated closure

From SIEM/EDR (Post-Engagement Analysis)

Technique detection review: After each red team engagement, query SIEM for evidence of techniques used, alert events, log events, no events, categorize by detection outcome
Alert attribution: Tag alerts generated during red team window with engagement ID, enables detection coverage calculation without reconstructing from logs post-engagement
UEBA baseline disruption: During assumed-breach, UEBA behavioral anomaly scores for the simulated compromised account, measure whether anomalies were surfaced and investigated

From ATT&CK Navigator and Threat Modeling

Coverage visualization: Import red team technique usage and detection results into ATT&CK Navigator; heat map shows coverage gaps
Priority technique list: Filtered by threat actors relevant to your sector; use as purple team exercise planning input
Year-over-year comparison: Compare ATT&CK heatmaps from consecutive years, improvement in detected/alerted techniques is the detection engineering ROI metric

From External Attack Surface Management (Censys, Shodan, Runzero)

Asset discovery before engagement: Run external ASM scan before each penetration test to ensure scope includes all internet-facing assets, not just those the internal team knows about
Post-engagement comparison: Compare new assets discovered by red team (not in inventory) against ASM discovery, measures asset inventory completeness
Unauthenticated finding comparison: Compare what the red team found unauthenticated vs. what ASM continuous monitoring surfaced, measures ASM effectiveness

Draxis turns these KRIs into a live signal

Draxis connects to the tools you already run (red team operation records, detection tooling, and remediation trackers) and computes these offensive security KRIs automatically, with the green/amber/red bands, trend lines, and drift alerts described above. No spreadsheets, no manual stitching.

See how Draxis reads your stack →