How Draxis is built
A stateless Node.js + TypeScript application on Cloud Run, a single Cloud SQL for PostgreSQL cluster with row-level security for tenant isolation, and scheduled Cloud Run jobs for the integration runner and projection service. Horizontally scalable, zero-downtime deploys, single-region today with Terraform modules parameterised for multi-region.
Top-level
- Application tier: Node.js + Express + TypeScript, containerized, deployed to Cloud Run as a stateless revision. Horizontally scaled; rolling deploys give true zero-downtime.
- Client: React + Vite + TypeScript + Tailwind. Compiled into
dist/, served by the same container, and cached at the edge by Cloud CDN. - Storage: Cloud SQL for PostgreSQL (single regional primary with a read replica). A single cluster holds every tenant; row-level security enforces tenant isolation at the database layer. See Multi-tenancy.
- Integration runner: a Cloud Run job triggered by Cloud Scheduler. Runs independently of the web tier, so long connector runs never block user-facing requests. See KRIs & sources.
- Projection service: a second Cloud Run job on its own schedule. Computes carrier rollups and other derived state.
- Edge: Global HTTPS Load Balancer with Cloud Armor (OWASP CRS WAF + adaptive protection + per-IP rate limits) and a Google-managed TLS cert.
- Secrets: GCP Secret Manager. Fetched on cold start, never persisted to disk, never in images or env files.
- Observability: Cloud Logging + Cloud Monitoring + Ops Agent + OpenTelemetry. Every LLM call is traced through Langfuse with tenant-scoped trace IDs.
Topology
A single region today (us-central1), with the Terraform modules parameterised on region so additional regions can be stood up without module edits. Global resources — Load Balancer, Cloud Armor, Cloud CDN, Artifact Registry — are shared across any regional deployments.
Internet
│
▼
Cloud Armor (WAF + DDoS + rate limits)
│
▼
Global HTTPS Load Balancer ──► Cloud CDN (static assets)
│
▼
┌─────────────────────────────────────┐
│ Cloud Run revision (app) │
│ - autoscaled, stateless, N pods │
│ - sets tenant session variable │
│ from JWT on every request │
└──────────────┬──────────────────────┘
│ (VPC connector, private)
▼
┌─────────────────────────────────────┐
│ Cloud SQL for PostgreSQL │
│ - primary + read replica │
│ - row-level security per tenant │
│ - CMEK, PITR, automated backups │
└─────────────────────────────────────┘
▲
│ (same VPC connector)
┌─────────────────────────────────────┐
│ Cloud Scheduler → Cloud Run jobs │
│ - integration runner (per-source) │
│ - projection / rollups │
└─────────────────────────────────────┘
Request lifecycle
- Browser hits the global anycast IP; Cloud Armor applies the security policy.
- Load Balancer routes to a healthy Cloud Run revision instance.
authenticateJWTmiddleware verifies signature + expiry.enforceMfaVerifiedrequires anmfa_verifiedclaim for anything except MFA-completion endpoints.enforceTenantAccessresolves the tenant from the JWT (or the path for platform / org admin routes).- The request’s first DB statement is
SET LOCAL draxis.tenant_id = '<uuid>', pulled from the JWT claim. Postgres RLS policies reference this setting. requirePermission(...)gates destructive operations on RBAC.- Handler runs SQL. RLS guarantees rows from other tenants are invisible, regardless of the query shape.
The integration runner
Connectors live under server/integrations/connectors/. The registry (server/integrations/registry.ts) imports every connector at module load and keys them by vendorId. The dispatcher (server/integrations/dispatcher.ts) resolves each saved kri_source row to its connector.
Connectors expose two methods:
test(ctx)— probe connectivity; used by the UI “Test connection” button.run(ctx)— fetch, compute KRI values, and upsert. Throws on unexpected schemas; the dispatcher captures the error and records it on the run row.
In production the runner is a Cloud Run job triggered by Cloud Scheduler. Each scheduled invocation queries for due kri_source rows and dispatches them in parallel (bounded concurrency). The job is stateless and horizontally scalable; multiple instances coordinate via row-level locking on the source being run. Manual “Run now” requests from the UI are dispatched as an ad-hoc Cloud Run job execution rather than in-process, so the web tier never blocks on a slow connector.
Expert panel orchestration
The panel-orchestrator (server/panelOrchestrator.ts) is a state machine that:
- Receives a user message targeted at a persona or the panel as a whole.
- Decides which specialists to invoke (the moderator persona makes this call).
- Issues one Anthropic API call per specialist, with a system prompt built from the tenant’s institutional knowledge and current KRIs / risks.
- Runs a synthesis step that composes a single reply (
server/synthesis.ts). - Writes every message and the synthesis rationale to the panel-session audit trail.
Panel sessions are pure request/response — the orchestrator does not hold in-memory state across requests, so it scales horizontally with the rest of the application tier.
Why one Postgres cluster with row-level security
- Isolation enforced at the database layer. Even a bug in application SQL cannot leak data across tenants — Postgres refuses to return rows where
tenant_id ≠ current_setting('draxis.tenant_id'). - Horizontal scaling. The app tier is stateless; Cloud Run can run any number of replicas, each serving any tenant.
- Cross-tenant rollups are native SQL. Org admins and platform admins run one query over the cluster, rather than opening N database files.
- One thing to back up. Point-in-time recovery, snapshots, and log shipping are single-cluster concerns.
- No co-tenancy trap. Adding or removing a tenant is an
INSERT/ soft-delete, not a filesystem operation.
Tenant identity (the organizations, tenants, and users tables) lives in the same cluster under a platform schema with stricter RLS — only superadmin sessions can read across it.
Deploys
- GitHub Actions authenticates to GCP via Workload Identity Federation (no long-lived service account keys).
- Pipeline: lint + typecheck → build image → sign with cosign → generate SBOM (Syft) → scan (Artifact Analysis + Trivy) → push to Artifact Registry.
- Schema migrations run as a pre-deploy Cloud Run job against Cloud SQL. Migrations are backwards-compatible so rollback stays safe.
- Deploy is a
gcloud run deploythat creates a new revision, gradually shifts traffic, and drains the old revision. Zero downtime — two revisions serve simultaneously during the cutover. - Images are pinned by digest, not by tag. Binary Authorization evaluates cosign attestations at deploy time (warn-only initially, enforced after a clean steady-state window).
- Rollback is a one-command traffic shift to the previous revision:
gcloud run services update-traffic … --to-revisions=<prev>=100.
Backups & recovery
- Cloud SQL automated backups + point-in-time recovery with transaction-log retention. RPO ≤ a few minutes; RTO target ≤ 30 minutes for a full restore.
- Audit logs are exported to a separate
draxis-securityproject with 400-day retention on a write-once bucket (tamper-evidence). - Restore drills run quarterly in staging. A failed drill blocks the next production deploy.
Observability
- App logs are structured JSON to stdout → Cloud Logging. Log-based metrics drive error-rate and slow-request alerts.
- Ops Agent (for the Cloud Run jobs) and Cloud Run’s built-in metrics feed Cloud Monitoring: CPU / memory, request p50 / p95 / p99, cold-start count, Anthropic API latency.
- External uptime checks hit
/healthfrom five global locations every 60s. - SLOs: 99.9% availability and p95
/api/*latency < 500ms (excluding long-polling LLM endpoints), 30-day rolling error rate < 1%. - Paging: pager-grade alerts email security@draxis.ai and post to an internal Google Chat webhook.