Top-level

  • Application tier: Node.js + Express + TypeScript, containerized, deployed to Cloud Run as a stateless revision. Horizontally scaled; rolling deploys give true zero-downtime.
  • Client: React + Vite + TypeScript + Tailwind. Compiled into dist/, served by the same container, and cached at the edge by Cloud CDN.
  • Storage: Cloud SQL for PostgreSQL (single regional primary with a read replica). A single cluster holds every tenant; row-level security enforces tenant isolation at the database layer. See Multi-tenancy.
  • Integration runner: a Cloud Run job triggered by Cloud Scheduler. Runs independently of the web tier, so long connector runs never block user-facing requests. See KRIs & sources.
  • Projection service: a second Cloud Run job on its own schedule. Computes carrier rollups and other derived state.
  • Edge: Global HTTPS Load Balancer with Cloud Armor (OWASP CRS WAF + adaptive protection + per-IP rate limits) and a Google-managed TLS cert.
  • Secrets: GCP Secret Manager. Fetched on cold start, never persisted to disk, never in images or env files.
  • Observability: Cloud Logging + Cloud Monitoring + Ops Agent + OpenTelemetry. Every LLM call is traced through Langfuse with tenant-scoped trace IDs.

Topology

A single region today (us-central1), with the Terraform modules parameterised on region so additional regions can be stood up without module edits. Global resources — Load Balancer, Cloud Armor, Cloud CDN, Artifact Registry — are shared across any regional deployments.

        Internet
           │
           ▼
   Cloud Armor (WAF + DDoS + rate limits)
           │
           ▼
   Global HTTPS Load Balancer  ──►  Cloud CDN (static assets)
           │
           ▼
   ┌─────────────────────────────────────┐
   │  Cloud Run revision (app)           │
   │  - autoscaled, stateless, N pods    │
   │  - sets tenant session variable     │
   │    from JWT on every request        │
   └──────────────┬──────────────────────┘
                  │ (VPC connector, private)
                  ▼
   ┌─────────────────────────────────────┐
   │  Cloud SQL for PostgreSQL           │
   │  - primary + read replica           │
   │  - row-level security per tenant    │
   │  - CMEK, PITR, automated backups    │
   └─────────────────────────────────────┘
                  ▲
                  │ (same VPC connector)
   ┌─────────────────────────────────────┐
   │  Cloud Scheduler → Cloud Run jobs   │
   │  - integration runner (per-source)  │
   │  - projection / rollups             │
   └─────────────────────────────────────┘

Request lifecycle

  1. Browser hits the global anycast IP; Cloud Armor applies the security policy.
  2. Load Balancer routes to a healthy Cloud Run revision instance.
  3. authenticateJWT middleware verifies signature + expiry.
  4. enforceMfaVerified requires an mfa_verified claim for anything except MFA-completion endpoints.
  5. enforceTenantAccess resolves the tenant from the JWT (or the path for platform / org admin routes).
  6. The request’s first DB statement is SET LOCAL draxis.tenant_id = '<uuid>', pulled from the JWT claim. Postgres RLS policies reference this setting.
  7. requirePermission(...) gates destructive operations on RBAC.
  8. Handler runs SQL. RLS guarantees rows from other tenants are invisible, regardless of the query shape.

The integration runner

Connectors live under server/integrations/connectors/. The registry (server/integrations/registry.ts) imports every connector at module load and keys them by vendorId. The dispatcher (server/integrations/dispatcher.ts) resolves each saved kri_source row to its connector.

Connectors expose two methods:

  • test(ctx) — probe connectivity; used by the UI “Test connection” button.
  • run(ctx) — fetch, compute KRI values, and upsert. Throws on unexpected schemas; the dispatcher captures the error and records it on the run row.

In production the runner is a Cloud Run job triggered by Cloud Scheduler. Each scheduled invocation queries for due kri_source rows and dispatches them in parallel (bounded concurrency). The job is stateless and horizontally scalable; multiple instances coordinate via row-level locking on the source being run. Manual “Run now” requests from the UI are dispatched as an ad-hoc Cloud Run job execution rather than in-process, so the web tier never blocks on a slow connector.

Expert panel orchestration

The panel-orchestrator (server/panelOrchestrator.ts) is a state machine that:

  1. Receives a user message targeted at a persona or the panel as a whole.
  2. Decides which specialists to invoke (the moderator persona makes this call).
  3. Issues one Anthropic API call per specialist, with a system prompt built from the tenant’s institutional knowledge and current KRIs / risks.
  4. Runs a synthesis step that composes a single reply (server/synthesis.ts).
  5. Writes every message and the synthesis rationale to the panel-session audit trail.

Panel sessions are pure request/response — the orchestrator does not hold in-memory state across requests, so it scales horizontally with the rest of the application tier.

Why one Postgres cluster with row-level security

  • Isolation enforced at the database layer. Even a bug in application SQL cannot leak data across tenants — Postgres refuses to return rows where tenant_id ≠ current_setting('draxis.tenant_id').
  • Horizontal scaling. The app tier is stateless; Cloud Run can run any number of replicas, each serving any tenant.
  • Cross-tenant rollups are native SQL. Org admins and platform admins run one query over the cluster, rather than opening N database files.
  • One thing to back up. Point-in-time recovery, snapshots, and log shipping are single-cluster concerns.
  • No co-tenancy trap. Adding or removing a tenant is an INSERT / soft-delete, not a filesystem operation.

Tenant identity (the organizations, tenants, and users tables) lives in the same cluster under a platform schema with stricter RLS — only superadmin sessions can read across it.

Deploys

  • GitHub Actions authenticates to GCP via Workload Identity Federation (no long-lived service account keys).
  • Pipeline: lint + typecheck → build image → sign with cosign → generate SBOM (Syft) → scan (Artifact Analysis + Trivy) → push to Artifact Registry.
  • Schema migrations run as a pre-deploy Cloud Run job against Cloud SQL. Migrations are backwards-compatible so rollback stays safe.
  • Deploy is a gcloud run deploy that creates a new revision, gradually shifts traffic, and drains the old revision. Zero downtime — two revisions serve simultaneously during the cutover.
  • Images are pinned by digest, not by tag. Binary Authorization evaluates cosign attestations at deploy time (warn-only initially, enforced after a clean steady-state window).
  • Rollback is a one-command traffic shift to the previous revision: gcloud run services update-traffic … --to-revisions=<prev>=100.

Backups & recovery

  • Cloud SQL automated backups + point-in-time recovery with transaction-log retention. RPO ≤ a few minutes; RTO target ≤ 30 minutes for a full restore.
  • Audit logs are exported to a separate draxis-security project with 400-day retention on a write-once bucket (tamper-evidence).
  • Restore drills run quarterly in staging. A failed drill blocks the next production deploy.

Observability

  • App logs are structured JSON to stdout → Cloud Logging. Log-based metrics drive error-rate and slow-request alerts.
  • Ops Agent (for the Cloud Run jobs) and Cloud Run’s built-in metrics feed Cloud Monitoring: CPU / memory, request p50 / p95 / p99, cold-start count, Anthropic API latency.
  • External uptime checks hit /health from five global locations every 60s.
  • SLOs: 99.9% availability and p95 /api/* latency < 500ms (excluding long-polling LLM endpoints), 30-day rolling error rate < 1%.
  • Paging: pager-grade alerts email security@draxis.ai and post to an internal Google Chat webhook.