Integration · ai_extractor

AI Drop Zone, S3 puller

Watch one folder per integration. Drop files into a bucket on whatever cadence makes sense for the source tool, and Draxis pulls, extracts, and proposes KRIs on schedule. The first run is a calibration pass, you approve the candidate KRIs before anything goes live.

sourceType: ai_extractor vendorId: dropzone-s3 auth: static keys or assume-role schedule: hourly / daily / weekly

What this is for

The AI Drop Zone, S3 puller is the scheduled, multi-folder counterpart to the manual AI Drop Zone. Use it when:

You have a tool without a Draxis connector that can export raw data on a schedule (a script, a SOAR playbook, a cron job).
You can publish that data to a known location, an S3 bucket, R2, MinIO, etc.
You want the resulting KRIs to refresh on schedule without a human pasting files into a UI.

Each integration row points at one bucket+prefix. Treat each prefix as one logical data source, one folder per tool / per data domain. That keeps the calibration profile tight and the extracted KRIs clean.

At a glance

Vendor	Draxis (first-party). The puller runs inside your Draxis instance and reaches out to the object store you configure.
Source type	`ai_extractor` (a.k.a. the generic dropdown in Settings → Integrations)
Vendor ID (slug)	`dropzone-s3`
Object stores supported	AWS S3 natively. Any S3-compatible endpoint works via the endpoint override, tested against Cloudflare R2, MinIO (with path-style addressing), and Wasabi.
Auth	Static access keys or AWS STS AssumeRole with an external ID (recommended, short-lived credentials).
Schedule	Standard hourly / daily / weekly options. Each tick lists new objects under the prefix and processes anything not seen before.
Per-run cap	50 new objects per scheduled run. Bigger backlogs drain across successive ticks.
Per-file cap	5 MB per object. Larger files are skipped (and recorded as processed so we don’t keep re-listing them).
Idempotency	Keyed on `(integration_id, object_key, etag)`. Re-uploading a file with the same name but new content (different etag) re-processes; an unchanged file does not.
What it produces	KRI value rows in your Risk Register, scoped to whichever canonical catalog signals the AI matched in the file.

How a folder becomes a virtual integration

You create the integration row in Settings → Integrations. Pick AI Drop Zone (S3 / object-store puller), enter bucket + region + prefix, and choose an auth mode.
The first scheduled (or manual) run pulls one sample file. Auto-accept is disabled for this run, so every signal the model proposes lands in the calibration panel underneath the integration card.
You review the candidate KRIs with their confidence scores and source-span evidence. Approve the ones that look right; uncheck anything wrong; optionally refine the domain hint.
Save & calibrate. The folder flips to calibrated. The list of approved signals is now the bias for future runs, the extractor stays tightly scoped to that set, and high-confidence proposals (≥ 0.85) auto-accept on every subsequent tick.

One folder, one virtual integration, one calibration profile. Spin up as many as you need, each one is independent.

Wire it into Draxis

Open Settings → Integrations, click Add integration, pick the AI Drop Zone (paste / upload / webhook) source type.
From the vendor dropdown choose AI Drop Zone, S3 / object-store puller.
Fill in the S3 source fields: bucket, region, prefix, file glob (default *), and the post-process action (leave, delete, or move to a processed prefix).
Pick an auth mode, static keys for the simplest setup, AssumeRole for production. See the IAM snippets below.
Optional: set a one-line domain hint (e.g. “EDR detection exports from CrowdStrike Falcon”) to bias the extractor on the first run.
Save. The integration starts in calibration_status='pending'. Click Run Now on the integration card to pull a sample, or wait for the first scheduled tick.
Review and finalize the candidate KRIs in the calibration panel below the integration card.

Least-privilege IAM, static keys

If you go with static access keys, attach this minimal policy to the IAM user. Replace my-bucket and my/prefix/ with your own values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DraxisDropzoneList",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-bucket",
      "Condition": { "StringLike": { "s3:prefix": ["my/prefix/*"] } }
    },
    {
      "Sid": "DraxisDropzoneRead",
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket/my/prefix/*"
    }
  ]
}

If you set post-process: delete, also grant s3:DeleteObject on the same prefix. If you set post-process: move, grant s3:DeleteObject on the source prefix and s3:PutObject on the destination prefix.

Least-privilege IAM, AssumeRole (recommended)

The Draxis-side principal calls sts:AssumeRole against a role you own; the role is what holds the read permissions on your bucket. Trust policy on the role:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::<DRAXIS-AWS-ACCOUNT>:root" },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": { "sts:ExternalId": "draxis-<your-tenant-slug>" }
    }
  }]
}

Attached permissions policy on the role, same as the static-keys version above (List + Get on the prefix; Delete / Put if you use post-process actions).

The external ID prevents the “confused deputy” class of mistakes, another Draxis tenant can’t accidentally (or maliciously) assume a role that wasn’t scoped to them. Use a value the customer chooses; don’t reuse the same ID across tenants.

Cloudflare R2 / MinIO / Wasabi

The puller speaks plain S3. To target a non-AWS endpoint:

Set S3-compatible endpoint to the provider’s URL (e.g. https://<account>.r2.cloudflarestorage.com).
For MinIO and some self-hosted gateways, also tick Force path-style addressing.
Use static keys (these providers don’t do AWS STS).
The IAM-policy syntax above is AWS-specific; on R2 / MinIO use the provider’s equivalent (R2 API tokens, MinIO bucket policies). Keep the surface area to list + get on the prefix you point at, same shape as the AWS version.

Calibration, the why

The reason for a first-run calibration step (instead of immediately auto-accepting on confidence) is that S3-pulled artifacts are coming from your tools, in your formats. The catalog signal allow-list is broad by default (P1 + P2 signals). On the very first file we’ve ever seen from a folder, “0.91 confidence” on something we’ve never validated as a fit for that folder is not the same kind of evidence as the same score on a long-running connector with established patterns.

So calibration is a one-time human-in-the-loop step: show me what you found in this folder, I’ll tell you which of those signals are real, and from now on this folder is biased toward exactly those. After that, the puller behaves like every other connector, high-confidence proposals auto-accept on schedule, the rest queue for review.

Operational notes

Per-run cap is intentional. 50 objects/run keeps a single tick from consuming an outsize share of LLM budget. A backlog of thousands drains over successive ticks.
Empty files are skipped. Zero-byte objects are recorded as processed (no etag thrash) but no extraction is attempted.
Unparseable binaries are tolerated. The extractor does best-effort UTF-8 decoding; if the result is garbage the LLM returns “no proposals” and the file is recorded as processed without writing any KRIs. No crash, no retry storm.
Failures are non-fatal per file. One file failing extraction (network blip, AWS 5xx, Anthropic timeout) is logged and counted as extract_failures in the run summary; the rest of the batch continues.
The pulled file lives in extractor_artifact with source='webhook' and the S3 object key as the filename. SHA-256 dedupe still applies, if you re-upload a byte-identical file under a different name, only the first one extracts.

Troubleshooting

Test connection fails with “NoSuchBucket”, check region. The S3 client uses the region you set, not the bucket’s actual region; a region mismatch presents as a missing bucket.
Test passes but runs always say “listed: 0”, the prefix is wrong, the file glob is too narrow, or every file has already been processed (check dropzone_processed_files for that integration).
Calibration panel says “no calibration sample to finalize”, the integration hasn’t pulled a file yet. Click Pull a sample now in the calibration panel, or hit Run Now on the integration card.
Same files keep getting re-pulled, etags are changing on each upload. If the writer is rewriting the same logical content with a new timestamp on every run, either deduplicate upstream or accept that each upload is a new artifact.
AssumeRole fails with “AccessDenied”, the trust policy on the role isn’t accepting the Draxis principal, or the external ID doesn’t match. Check the role’s trust policy and that the external ID in the integration form matches exactly.
Still stuck? Open a support ticket with the integration ID, the bucket, and the run summary from the integration’s history.