Skip to content

Production Launch Runbook

End-to-end procedure for taking Sapari from "staging works" to "production live" on a self-managed Hetzner CPX62 (16 vCPU / 32 GB) sized for up to 1000 concurrent users.

This document is the launch-day spine: pace yourself through Phases 1-6 in order, and jump into a dedicated runbook (linked at the bottom) when a step needs more depth. Time budget: ~6-8 hours of focused work spread across a day or two; Phase 5 (cutover) takes ~2 hours including smoke testing.

For ongoing operations (steady-state deploys, rollbacks, monitoring) see deployment.md. This runbook only covers the first-time launch and the operational gotchas that go with it.


Topology

                    ┌──────────────────────────┐
                    │ Cloudflare DNS (sapari.io)│
                    └────────────┬─────────────┘
          ┌──────────────────────┼──────────────────────────┐
          │                      │                          │
          ▼                      ▼                          ▼
    sapari.io          app.sapari.io                api.sapari.io
    (Pages: landing)   (Pages: frontend)            (grey cloud)
                            │                              │
                            │  /api/*, /media/v1/*         │
                            ▼                              ▼
                  ┌──────────────────┐            ┌──────────────────┐
                  │ CF Worker        │            │ Hetzner CPX62    │
                  │ (sapari-proxy-   │            │ 16 vCPU / 32 GB  │
                  │  production)     │────────────│ Caddy → web      │
                  └────────┬─────────┘            │ + 6 TaskIQ       │
                           │                      │   workers        │
                           │ R2 bindings          │ + scheduler      │
                           ▼                      │ + Redis          │
                    ┌──────────────┐              │ + RabbitMQ       │
                    │ R2 buckets   │              └────────┬─────────┘
                    │ raw/exports/ │                       │
                    │  assets      │                       │ DATABASE_URL
                    └──────────────┘                       ▼
                                                  ┌──────────────────┐
                                                  │ Neon Postgres    │
                                                  │ (Launch tier)    │
                                                  └──────────────────┘

External integrations: Stripe (webhooks → api.sapari.io), Postmark (email), OpenAI + DeepSeek (AI), Logfire (observability), Sentry (frontend, optional), Discord (Beszel alerts, optional), Tailscale (CI/CD SSH).


Capacity Sizing — Target

Goal: support up to 1000 concurrent users decently usable on a single CPX62 box.

Per-user resource footprint (verified in infrastructure/events/subscriber.py:266-312): - 1 long-lived SSE response held by one uvicorn worker - 2 Redis pubsub connections (notification channel + asset channel, fan-in pattern) - +1 more Redis pubsub conn when in the editor (subscribe_to_project)

Aggregate at 1000 concurrent users: - ~3000-4000 file descriptors held by the web container - ~2300 Redis pubsub connections - ~50-100 simultaneous active request senders out of 1000 (rest idle on SSE) - Realistic concurrent DB queries: ~30-50

Box budget:

CPU (vCPU) Memory (GB)
Web (FastAPI, 4 workers) 4.0 6.0
6 TaskIQ workers + scheduler 11.6 14.5
Caddy + Redis + RabbitMQ 1.25 2.0
Dozzle + Beszel hub + agent 0.45 0.4
Allocated ~17.3 ~22.9
CPX62 16 32
Headroom (incl. OS/page cache/FFmpeg surge) (-1.3 oversubscription, fine on async I/O) 9.1 (28%)

CPU oversubscription is fine on shared-vCPU CPX62 because Sapari is async-I/O-bound (asyncpg, httpx, Redis) — workers spend most cycles waiting on network. The CPU caps are ceilings, not reservations.


Architectural Ceilings — Acknowledged for v1

Limits inherent to the single-box architecture. Tuning does NOT solve them; they need product/UX decisions or post-v1 horizontal scaling. Listed up front so they don't get lost:

  1. Render throughput is single-threaded. One render worker, one FFmpeg job at a time. A 60-min source render takes ~30-60 min wallclock. At 1000 concurrent users, even 0.5% triggering a render in the same hour = 5 queued, last user waits 2+ hours. v1 mitigation: explicit "Queued" UX (already in export-progress UI — verify copy in smoke test). Post-v1 fix: render worker replicas.
  2. Proxy throughput is single-threaded (same shape, lower demand — only triggers on codec mismatch on upload).
  3. External API rate limits:
  4. OpenAI Whisper: ~50 RPM on Tier 1. Verify Tier 2+ for 1000-user target.
  5. DeepSeek: verify plan limits.
  6. Postmark: realistic email volume at 1000 users is 2-5k/month (auth + billing events; render/analysis are SSE, not email). Starter (\(15/mo, 10k cap) covers steady state; 50k tier (\)50/mo) buys launch-burst headroom.
  7. SSE reconnect storms. Web container restart drops 1000 SSE connections simultaneously; frontend exponential backoff (1s, 2s, 4s, 8s, 16s, cap 30s) helps but resets on onopen — second-wave thunder is possible. Mitigation (jitter on first retry) is post-launch.
  8. RabbitMQ queue depth visibility. No default operator dashboard for queue depth. Mgmt UI at :15672 shows it but isn't exposed. Add queue-depth Logfire span or admin endpoint post-launch.
  9. Single-box failure mode. OOM, FFmpeg subprocess crash, RabbitMQ memory pressure — any one takes down some-or-all users. v1 accepts; manual recovery in §Operational Gotchas.

Prerequisites — Accounts You Need

Sign up for these before starting Phase 1. None require coordination with another step.

  • Hetzner Cloud — for the production server (CPX62 or larger)
  • Cloudflare — DNS, Pages, Workers, R2, Access
  • Neon — Launch tier (\(0.106/CU-hour; ~\)76/mo if you set min CU=1, default is scale-to-zero=$0 idle)
  • Stripe — live mode activated (KYC complete, payouts configured)
  • Postmark — sender domain verification ready; Starter plan ($15/mo, 10k emails) sufficient for 1000 users in steady state
  • OpenAI — billing-enabled account; Tier 2+ recommended for 1000-user Whisper RPM
  • DeepSeek — billing-enabled account
  • Logfire — free tier is fine to start
  • GitHub — deploy key + production environment secrets
  • Tailscale — OAuth client configured (same one as staging works for prod)
  • Optional: Discord — webhook URL for Beszel alerts
  • Optional: Sentry — frontend project for error tracking

Phase 1 — Provision External Services

Order doesn't matter; everything is independent. Aim to finish in one sitting so secrets are fresh.

1.1 Neon Postgres (Launch tier)

  • Create new Neon project named sapari-production
  • Region: pick the same one as the Hetzner box (Ashburn → us-east-1, Hillsboro → us-west-2, Falkenstein → eu-central-1)
  • Subscribe to Launch tier ($0.106/CU-hour). Free tier is too restrictive for prod
  • Scale-to-zero: keep enabled (default 5 min). pool_pre_ping=True (session.py:14-19) + pool_recycle=300 catch stale-after-pause connections before any query runs. Watch for Logfire db.* p99 spikes post-launch; if seen, set min CU=1 ($76/mo always-on)
  • Copy the direct endpoint connection string (NOT pooler) — convention #16
  • Convert to async form: postgresql+asyncpg://...
  • Save as DATABASE_URL for Phase 3

Connection budget: at min CU=1, max_connections=419. Cluster ceiling at WEB_WORKERS=4 is web 4×30 + workers 6×30 + scheduler 30 = 330. Comfortable 21% headroom. Auto-scales to 16 CU under burst.

1.2 Cloudflare R2 (3 buckets)

  • Create buckets: sapari-raw, sapari-exports, sapari-assets
  • Enable versioning on all three before any data lands. R2 doesn't version by default; an accidental delete or overwrite is gone forever. Toggle at the bucket level in the dashboard. CANNOT be retroactively enabled to recover already-deleted files
  • Create an API token scoped to Object Read & Write on all three buckets only — narrower than account-wide. Token scope is one-time at creation; can't be narrowed retroactively
  • Save: STORAGE_ACCESS_KEY_ID, STORAGE_SECRET_ACCESS_KEY, STORAGE_ENDPOINT (the https://<account-id>.r2.cloudflarestorage.com form)

1.3 Cloudflare Worker secret (placeholder)

  • MEDIA_TOKEN_SECRET=$(openssl rand -base64 32) — save it; this exact value goes both in the backend .env and as a Worker secret. Byte-identical is load-bearing

1.4 Stripe (live mode)

  • Toggle Stripe Dashboard to live mode
  • Copy sk_live_... and pk_live_...STRIPE_SECRET_KEY, STRIPE_PUBLISHABLE_KEY. Test-mode keys (sk_test_) on production fail every transaction silently; live-mode keys on staging will charge real cards. Verify mode visually before copying
  • Create webhook endpoint at https://api.sapari.io/api/v1/webhooks/stripedirect to backend (api.*, NOT app.*). URL won't resolve yet; create it anyway. The CF Worker proxy is for browser API calls; webhooks should hit Caddy → backend directly
  • Subscribe to events: customer.subscription.updated, customer.subscription.deleted, invoice.payment_failed, charge.refunded, checkout.session.completed
  • Copy webhook signing secret (different from API keys; one per endpoint) → STRIPE_WEBHOOK_SECRET. A wrong webhook secret silently breaks payment processing — backend returns 401, Stripe retries a few times then gives up, subscriptions don't activate. Symptom: "user paid but doesn't see credits." Test reachability post-deploy via Stripe Dashboard's "Send test webhook" button (Phase 4.7)
  • Set STRIPE_TEST_MODE=false
  • Tier 3 (Creator) and Tier 4 (Viral) products + prices are auto-seeded by seed_stripe_products.py on first deploy; do not pre-create

1.5 Postmark

DNS propagation can take hours — start early in Phase 1 so it's done by Phase 5.

  • Add sapari.io as a sender domain
  • Add all three DNS records to Cloudflare. Postmark Dashboard surfaces exact values:
  • DKIM (<selector>._domainkey.sapari.io TXT) — signs outbound mail
  • SPF (TXT on apex) — authorizes Postmark to send on your behalf
  • Return-Path (CNAME) — bounce-handling subdomain
  • Wait until all three show green in Postmark dashboard before proceeding. Skipping = mail in spam or bounced
  • Plan tier: Starter (\(15/mo, 10k emails) covers 1000 users in steady state (auth + billing events only — render/analysis/asset events are SSE, not email). Realistic volume ~2-5k emails/month. Upgrade to 50k tier (\)50/mo) if you want launch-burst headroom
  • Use a separate Postmark server token per environment (one for staging, one for production). Sharing muddles deliverability stream
  • Save the production server token → POSTMARK_SERVER_TOKEN
  • Sender reputation warning: even with green DKIM/SPF/Return-Path, Postmark starts new domains with low reputation. ISPs throttle. Don't blast 1000+ users on day 1 — drip launch announcements

1.6 OpenAI + DeepSeek

  • Create API key at platform.openai.com → OPENAI_API_KEY. Tier 2+ recommended for 1000-user target: Tier 1 caps Whisper at ~50 RPM; sustained burst at scale could 429. OpenAI auto-tiers up with usage history; if you've been on Tier 1, request Tier 2 explicitly via support before launch
  • Create API key at platform.deepseek.com → DEEPSEEK_API_KEY. Verify plan limits cover ~80 analyses/hour throughput target
  • Set hard spending caps in both dashboards (recommended: $200/mo OpenAI, $50/mo DeepSeek to start; tune after real usage)

1.7 Logfire

  • Create project named sapari-production (or share with staging using environment tag)
  • Get write token → LOGFIRE_TOKEN
  • If sharing with staging: set LOGFIRE_ENVIRONMENT=production in prod .env so spans are tagged
  • Sentry — frontend project, copy DSN. Add to frontend/.env.production as VITE_SENTRY_DSN. Frontend's perfMarks.ts already adds breadcrumbs under category perf
  • Discord webhook — create in your Discord server. Beszel uses shoutrrr format, NOT raw HTTPS URL. Beszel silently swallows malformed URLs:
    Raw Discord URL:     https://discord.com/api/webhooks/<webhook-id>/<token>
    Shoutrrr format:     discord://<token>@<webhook-id>
    
    Note token first, then ID — opposite of the URL form. Verify alerts arrive (stress -c 4 -t 60 triggers CPU>80% alert)

Phase 2 — Hetzner CPX62 + Tailscale + Caddy Setup

Allow ~90 min start to finish; longer if first time.

2.1 Pick the box

  • Size: CPX62 — 16 shared vCPU, 32 GB RAM, 640 GB SSD, 20 TB transfer, $59.49/mo. Sized to support 1000 concurrent users with the production tuning. CCX33 (8 vCPU dedicated, 32 GB) is a viable swap if you prefer dedicated CPU at higher cost
  • OS: Ubuntu 24.04 LTSsetup-server.sh checks for ssh.service (24.04+) vs sshd.service (older); 24.04 is the tested baseline
  • Region: match the Neon region from Phase 1.1. The DB-to-server hop is on the critical path for every API request
  • SSH key: add yours to the Hetzner project (cloud-init drops it into root@ authorized_keys automatically)
  • Firewall: attach the existing firewall-tailscale Hetzner Cloud Firewall at server creation if you have one. Defense in depth on top of host-level UFW (which setup-server.sh configures separately). Rule set should cover 443/tcp + 443/udp (HTTP/3) from anywhere, 22/tcp from Tailscale CGNAT (100.64.0.0/10), and IPv6 sources (::/0) — not just IPv4
  • Enable automated backups at order time — Hetzner adds ~20% to monthly cost (~$12/mo for CPX62) for 7-day rolling snapshots of the entire disk. Toggle during provisioning; flipping it on after leaves an initial unprotected window. Dramatically cheaper than a custom backup pipeline; covers Redis, RabbitMQ, Caddy certs, .env in one operation
  • Note the IPv4 — this becomes api.sapari.io's A record

2.2 Add the DNS record now (low TTL, grey cloud)

Do this before setup-server.sh so DNS has time to propagate by the time Caddy needs it for ACME DNS-01.

  • Cloudflare DNS for sapari.io → add A record:
  • Name: api
  • IPv4: <hetzner-ip>
  • Proxy: OFF (grey cloud) — Caddy terminates TLS itself; double-proxying through CF orange-cloud breaks the cert flow at this point. Once stable, flipping to orange-cloud is a post-launch hardening step (see Phase 6.3)
  • TTL: 60 seconds — fast rollback option during cutover. Bump to Auto in Phase 6

2.3 Run setup-server.sh

setup-server.sh is idempotent (safe to re-run) but takes a required --my-ip flag — whitelists only that IP for SSH on port 22. Wrong IP = locked out (Hetzner has console rescue if needed).

# Find your operator IP first:
curl -s ifconfig.me

# Then on the server, as root:
ssh root@<hetzner-ip>
git clone https://github.com/benavlabs/sapari.git /opt/sapari-bootstrap
cd /opt/sapari-bootstrap
./scripts/deployment/setup-server.sh --my-ip <your-operator-ip> --hostname sapari-prod

What it does (verify each):

Step What
Hostname Sets to sapari-prod; updates /etc/hosts
deploy user Created with sudo + docker group, passwordless sudo (intentional for CD; safe because SSH is key-only and tailnet-gated)
SSH hardening Disables root login, disables password auth, requires pubkey
UFW firewall Default deny inbound; allows 443/tcp from anywhere; 22/tcp only from --my-ip. No port 80 — Caddy uses DNS-01 ACME
unattended-upgrades + fail2ban Auto-security-patches; SSH brute-force throttling
Docker Official get.docker.com install
GitHub deploy key Generates Ed25519 keypair at /home/deploy/.ssh/github_deploy_key; pre-seeds known_hosts for github.com. Prints the public key at the end — manually add to repo's Deploy Keys
Origin firewall service Installs sapari-docker-firewall.service (oneshot systemd unit, After=docker.service) that locks the Docker DOCKER-USER chain :443 to Cloudflare's published egress IPs. See Phase 6.3

What it does NOT do (manual, in 2.4): - Install or join Tailscale - Add swap - Verify NTP - Configure GHCR auth (not needed — images are public)

  • Copy printed Ed25519 public key
  • GitHub repo → Settings → Deploy keys → Add deploy key → paste, name it sapari-prod, leave "Allow write access" unchecked → Add
  • Verify deploy user can clone: ssh deploy@<hetzner-ip> "ssh -T git@github.com" should print "Hi ! ..."

2.4 Manual hardening — swap, NTP, Tailscale

Swap (prevents FFmpeg OOM):

Hetzner cloud images ship with zero swap. Render worker FFmpeg + uploaded video can spike past per-container limits during a large render → swapless OOM takes down the whole box.

# As root or with sudo:
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
swapon --show  # verify
  • 4 GB swap created and persisted across reboots

NTP (load-bearing for JWT verification):

Media token JWTs have a 5-minute TTL. Clock skew of even ~10s between server and CF Worker breaks playback ("token expired" 401s on freshly-minted URLs). systemd-timesyncd is on by default in Ubuntu 24.04, but verify:

timedatectl
# Want: "System clock synchronized: yes" + "NTP service: active"
  • Clock is NTP-synced

Tailscale (CI access path):

Tailscale is pre-installed on Hetzner Ubuntu but not running. Bring up manually:

sudo tailscale up --hostname=sapari-prod
# Follow URL, log in, authorize the node

Then in Tailscale admin console:

  • Machines → find sapari-prod → Edit ACL tags → add tag:server
  • Verify ACL grants tag:ci → tag:server (or allow-all). The OAuth client used by GitHub Actions is reusable from staging
  • Note the tailnet IP (100.x.y.z) — this is what GitHub Actions SSHs to, not the public IPv4. Save as SSH_HOST GitHub secret for the production environment

2.5 GitHub Actions secrets for the production environment

GitHub repo → Settings → Environments → New environment named production. Add:

  • SSH_HOST — tailnet IP from 2.4 (not public IPv4)
  • SSH_KEY — private half of a fresh Ed25519 keypair generated locally; install public half on server with ssh-copy-id deploy@<tailnet-ip> (over Tailscale). Keep this key separate from your operator key — CI-only, revocable independently
  • TS_OAUTH_CLIENT_ID and TS_OAUTH_SECRET — same as staging environment

Set deploy gate (typed YES confirmation, required reviewer, deploy window) on the production environment's protection rules if desired.

2.6 Caddy — what to expect, and the F1 gotcha

Caddy is one of the containers first-deploy.sh will start in Phase 3. A few things to know:

  • Config: caddy/Caddyfile (committed). Reverse-proxies api.sapari.ioweb:8000, sets X-Forwarded-For from Cloudflare's cf-connecting-ip so backend logs see real client IPs
  • TLS via Cloudflare DNS-01: needs CLOUDFLARE_API_TOKEN in .env (Phase 3.1) with Zone:DNS:Edit permission on sapari.io only. First deploy: cert acquisition takes ~30-60 seconds; check docker logs caddy if no successful cert event
  • F1 (known gotcha — not yet fixed in deploy.sh): docker-compose.prod.yml bind-mounts ./caddy/Caddyfile. When deploy.sh runs git reset --hard, git replaces the file via tempfile rename — inode changes, container's bind mount points at stale inode. Caddy doesn't see new config until the container is force-recreated.

Workaround: any time you change the Caddyfile, after deploy completes, run on the server: docker compose -f docker-compose.prod.yml up -d --force-recreate caddy. Burn into muscle memory.

2.7 Pre-flight before Phase 3

  • dig api.sapari.io +short from anywhere returns the Hetzner IPv4
  • ssh deploy@<tailnet-ip> works (over Tailscale)
  • ssh deploy@<tailnet-ip> "docker --version" prints a version (deploy in docker group)
  • ssh deploy@<tailnet-ip> "swapon --show" shows the 4 GB swapfile
  • ssh deploy@<tailnet-ip> "timedatectl | grep synchronized" shows yes
  • GitHub repo Deploy keys list contains sapari-prod Ed25519 key
  • GitHub production environment has all four secrets set

Phase 3 — First Deploy

This is where the prod app first runs. Plan ~1 hour with debugging margin.

3.1 Clone repo + write .env.production

ssh deploy@<hetzner-ip>  # via Tailscale
git clone https://github.com/benavlabs/sapari.git ~/sapari
cd ~/sapari
cp backend/.env.production.example .env

Critical values the security validator enforces (app refuses to start otherwise): - SECRET_KEY — 32+ chars; python -c "import secrets; print(secrets.token_urlsafe(64))" - POSTGRES_PASSWORD — must NOT be postgres (or pass full DATABASE_URL instead) - CREATE_TABLES_ON_STARTUP=false - ENVIRONMENT=production - DEBUG=false - STRIPE_TEST_MODE=false - SESSION_SECURE_COOKIES=true - ADMIN_USERNAME — NOT admin - ADMIN_PASSWORD — 12+ chars, not in weak password list - TASKIQ_RABBITMQ_USER and _PASSWORD — NOT guest/guest

Differs from staging — double-check: - OAUTH_REDIRECT_BASE_URL=https://app.sapari.io - MEDIA_PROXY_BASE_URL=https://app.sapari.io (no trailing slash) - FRONTEND_URL=https://app.sapari.io - API_PUBLIC_URL=https://api.sapari.io (newsletter confirm/unsubscribe email links) - LANDING_URL=https://sapari.io (newsletter confirm/unsubscribe redirect targets) - CORS_ORIGINS=https://app.sapari.io,https://sapari.io (must include landing or newsletter signup POST is CORS-blocked) - LOGFIRE_ENVIRONMENT=production - All Stripe keys live (sk_live_, pk_live_, whsec_ from live webhook)

Tuning values to set explicitly for prod (these turn the parameterization on — without them, compose uses staging defaults): - WEB_WORKERS=4, WEB_MEMORY=6g, WEB_CPUS=4.0, WEB_MEMORY_RESERVATION=1g - RENDER_MEMORY=6g, RENDER_CPUS=4.0, RENDER_FFMPEG_THREADS=4, RENDER_MEMORY_RESERVATION=2g - PROXY_MEMORY=3g, PROXY_CPUS=3.0, PROXY_FFMPEG_THREADS=3 - DOWNLOAD_MEMORY=2g, DOWNLOAD_CPUS=2.0 - ANALYSIS_MEMORY=2g, ANALYSIS_CPUS=1.5, ANALYSIS_TASKIQ_CONCURRENCY=4 - REDIS_MAXMEMORY=500mb, REDIS_MEMORY=768m, REDIS_CPUS=0.5 - RABBITMQ_MEMORY=1g - CADDY_MEMORY=256m, CADDY_CPUS=0.25 - POSTGRES_POOL_SIZE=20, POSTGRES_MAX_OVERFLOW=10 - STORAGE_MAX_UPLOAD_SIZE_MB=10240 - MAX_VIDEO_DURATION_MINUTES=90

3.2 Verify GHCR image pull

GHCR images are public today — no docker login needed.

docker pull ghcr.io/benavlabs/sapari-backend:production
  • Pull succeeds. If you ever flip the repo to private: PAT with read:packages, then echo $TOKEN | docker login ghcr.io -u <user> --password-stdin

3.3 Run first-deploy.sh

./scripts/deployment/first-deploy.sh

Order: pull image → migrate (alembic upgrade head) → seed (tiers, admin user, Stripe products) → start all services → health check.

  • First-time seed creates: 4 tiers (free/hobby/creator/viral), the admin user, Stripe products for Creator + Viral
  • Watch for migration errors — they abort the deploy
  • If seed_stripe_products.py errors → STRIPE_SECRET_KEY is wrong. Fix and re-run just the seed: ./scripts/deployment/run-task.sh backend/scripts/seed_stripe_products.py

3.4 Verify the box is alive (before DNS routes traffic)

# On the server:
curl -f http://localhost:8000/health           # liveness
curl -f http://localhost:8000/health/ready     # readiness — DB + Redis + RabbitMQ + storage all green

# From your laptop, hitting IP directly (Caddy will reject; cert is for api.sapari.io):
curl -k https://<hetzner-ip>/health -H "Host: api.sapari.io"
  • Liveness: 200
  • Readiness: 200 with all subsystems green
  • If Caddy hasn't obtained the cert, check docker logs caddy — DNS-01 needs valid CLOUDFLARE_API_TOKEN and api.sapari.io resolving to this IP

3.5 Verify externally once DNS propagates

# After a couple minutes:
curl https://api.sapari.io/health
curl https://api.sapari.io/health/ready
  • Both return 200 with valid certs (no -k needed)

Phase 4 — Cloudflare Worker + Pages + DNS

The Worker handles /api/* proxying to api.sapari.io and /media/v1/* for R2 media. Pages serves the frontend at app.sapari.io and the landing at sapari.io.

4.1 Frontend build env

  • In frontend/, create or update .env.production (committed file is fine; no secrets):
  • VITE_API_BASE_URL=https://app.sapari.io (Worker proxies /api/* to backend)
  • VITE_SENTRY_DSN=<from 1.8> if using Sentry
  • Push to main (or your prod branch) — Cloudflare Pages builds + deploys automatically

4.2 Frontend custom domain

  • CF Pages dashboard → frontend project → Custom domains → add app.sapari.io
  • CF auto-creates the CNAME; verify it propagates

4.3 Landing page

  • Same flow for landing/ Pages project — custom domain sapari.io (apex). Pages handles apex via CNAME flattening
  • Set PUBLIC_API_ORIGIN=https://api.sapari.io in the landing Pages project env vars so the newsletter signup form POSTs to the correct origin (empty value renders an inline "Misconfigured" error on the signup button)

4.4 Cloudflare Worker — secret + deploy

# From your laptop, in worker/ directory:
cd worker
npx wrangler secret put MEDIA_TOKEN_SECRET_V1 --env production
# Paste the SAME value you put in backend .env as MEDIA_TOKEN_SECRET — byte-identical

npm run deploy:production
# If wrangler reports "No deploy targets":
npx wrangler versions deploy --env production
  • Verify fingerprints match. Backend logs: docker logs sapari-backend | grep media_token should print media_token: active=v1 registry=[v1:<8-char-hex>]. Then npx wrangler tail --env production and load a clip in the browser — same fingerprint should appear

4.5 Cloudflare Worker — route patterns (dashboard, NOT wrangler.toml)

Worker [[routes]] in wrangler.toml is intentionally empty — Pages binding can't coexist with route patterns there (CF returns error 10144 if both are declared).

  • CF Dashboard → Workers & Pages → sapari-proxy-production → Settings → Domains & Routes → Add (order matters — first match wins):
  • app.sapari.io/media/v1/* ← add this first
  • app.sapari.io/api/* ← then this

If /api/* comes before /media/v1/*, media requests like /media/v1/<jwt> match the API rule first and get proxied to the backend. Symptom: 404 on every clip play. No error message at deploy — purely order-of-rules.

Without route patterns at all, requests bypass the Worker and hit Pages' SPA fallback (returns index.html for /api/* — symptom is "API calls return HTML").

4.6 Cloudflare Access policies

  • Production app (app.sapari.io): NO Access policy. Public
  • internal-docs.sapari.io: same Access app as staging (GitHub team benavlabs/Sapari)
  • If/when adding dozzle-prod.sapari.io and beszel-prod.sapari.io: gate via the same Access app

4.7 Stripe webhook reachability test

  • Stripe Dashboard → Webhooks → your endpoint → "Send test webhook" → pick customer.subscription.updated
  • Check docker logs sapari-backend | grep webhook — should show signature-verified receipt

Phase 5 — Cutover & Smoke Test

Live moment. ~2 hours including verification.

5.1 Pre-flight (do once, before the cutover hour)

  • All Phases 1-4 boxes ticked
  • https://api.sapari.io/health/ready returns 200 from outside the network
  • https://app.sapari.io loads the frontend (with Worker routes, /api/* proxies)
  • Stripe live webhook test passes
  • Postmark send-test passes
  • TTL on api.sapari.io is 60s (set in 2.2)

5.2 Smoke test the running prod app (before announcing)

Run through every critical user journey while the app is live but unannounced. Use a real (non-admin) user account or create one fresh.

Auth + onboarding - [ ] Signup → email verification → login → logout → login again - [ ] OAuth: Google login (if configured), GitHub login (if configured)

Editor cold paths - [ ] Create two projects, switch between them — editor doesn't blank, no wrong-project mutation if you spam delete-edit - [ ] Cold load /projects/:uuid: editor skeleton renders, not generic spinner - [ ] Cold load /projects/new: list skeleton, NOT editor skeleton

Upload paths - [ ] Upload < 25 MiB clip (single-PUT path) → presign + PUT, confirm endpoint succeeds - [ ] Upload 100 MB+ clip (multipart path) → multipart initiate + parallel parts + complete; faster than single-PUT - [ ] Multipart cancel + resume: cancel mid-upload, drop same file in again → resumes from next missing part (network tab: /multipart/parts listing precedes new PUTs) - [ ] Upload 11 GB file → frontend rejects with friendly "10 GB max" copy BEFORE any bytes upload - [ ] Upload 95-min video → frontend rejects with "90 minutes max" copy - [ ] Upload short video with weird MOV container that browser can't probe metadata → upload proceeds (fail-soft), backend ffprobe rejects with backend error if too long

Pipelines - [ ] Trigger analysis run, watch SSE events arrive (analysis_progress, analysis_complete) - [ ] Render export, watch progress, download result, play it - [ ] Verify "Queued" UX when render starts and worker is busy (manually queue a 2nd render to test the queue-grow path)

Billing - [ ] Subscribe to Creator tier with a real card, verify webhook fires, verify entitlements grant - [ ] Cancel subscription mid-flight — verify cancellation feedback flow + retention offers

Assets - [ ] Upload image + video assets, verify they render in asset library

Newsletter - [ ] Submit landing newsletter form with a fresh email → confirmation email arrives within ~30 s - [ ] Click confirm link → lands on /newsletter/confirmed?status=confirmed (English default; pt/es subscribers land on /pt/newsletter/confirmed?... or /es/newsletter/confirmed?... once the localized Astro routes are populated — default-locale paths are unprefixed per landing/astro.config.mjs) - [ ] Click unsubscribe link in any received email → lands on /newsletter/unsubscribed?status=unsubscribed (same locale-prefix convention as confirm above)

SSE - [ ] DevTools network: confirm only ONE EventSource at /api/v1/events/user-stream - [ ] Force-disconnect SSE (DevTools offline throttle): polling fallback engages within ~5s, both notifications and assets caches invalidate; reconnect → polling stops

Mobile - [ ] Log in on a real phone, do the wizard end-to-end, landscape review works

Health endpoints from monitoring tools - [ ] /health + /health/ready both green

5.3 If smoke test fails

  • Trace via Logfire — find failing span, fix forward if possible
  • If unfixable in <30 min: rollback. From server: ./scripts/deployment/rollback.sh <previous-sha>
  • If schema is the issue and rollback.sh aborts: Neon time-travel restore to revert schema, then rollback the image

5.4 Announce

Once smoke test is green:

  • Bump api.sapari.io TTL back to Auto (no longer need fast-rollback window)
  • Announce on whichever channel (Twitter, mailing list, Discord, etc.) — drip, not blast (Postmark sender reputation)
  • Tail Logfire for the first hour — anomalies show up here first

Phase 6 — Post-Launch Tuning + Hardening

Do these in days/weeks after launch. None block go-live; each compounds reliability.

6.1 Tuning observation (week 1)

Watch for signals that would change the resource-sizing decisions:

  • Neon scale-to-zero impact: Logfire db.* span p99 spikes correlated with idle gaps. If real signal shows up, set min CU=1 in Neon dashboard (~$76/mo always-on). Otherwise save it
  • Web worker memory pressure: Beszel web container memory under sustained 100+ concurrent users. If hitting 80%+ of 6g limit, bump WEB_MEMORY=8g in .env.production
  • Caddy CPU under login burst: TLS handshake spikes saturating 0.25 cpu. Bump CADDY_CPUS=0.5 if seen
  • Redis CPU under pubsub fan-out: 0.5 cpu sufficient at target load. Bump if Beszel shows sustained >70%
  • Render queue depth: visible signal that single-worker render is bottleneck. Either raise UX expectations ("est. 30 min wait") or add a second render worker container (vertical replica)
  • Postmark deliverability: log into dashboard, watch reputation score in week 1 — should climb from neutral

6.2 Monitoring on production (week 1)

  • Add Beszel + Dozzle to production (already in docker-compose.prod.yml; just need agent KEY/TOKEN bootstrap from the hub UI)
  • Gate beszel-prod.sapari.io and dozzle-prod.sapari.io behind Cloudflare Access (reuse GitHub team policy)
  • Wire Discord webhook with prod-specific channel — alerts for CPU >80%, memory >80%, disk >90%, container restarts
  • Set up Logfire alert on API error rate >1%
  • Set up Logfire alert on task failure rate >5%
  • Add queue-depth observability — Logfire span or admin endpoint reporting per-broker RabbitMQ queue depth (from architectural-ceiling #5). Can be a simple /admin/queues endpoint reading rabbitmqctl list_queues

6.3 Origin firewall + Cloudflare orange-cloud cutover

Two changes that land together to lock origin traffic to Cloudflare's edge.

Origin firewallscripts/deployment/sapari-docker-firewall.sh + .service install during setup-server.sh. The unit locks the Docker DOCKER-USER iptables chain :443 to Cloudflare's published v4 + v6 egress lists (UFW alone doesn't cover Docker-NAT'd ports because Docker manages its own iptables rules that run before UFW's INPUT chain). The unit is oneshot After=docker.service, fetches CF's IP lists at boot, scopes rules to the WAN interface, re-applies idempotently. Verify with:

iptables -L DOCKER-USER -n -v --line-numbers
# Expect ~15 v4 + 7 v6 CF CIDR ACCEPTs above one final DROP for dpt:443
systemctl status sapari-docker-firewall
# Expect "Active: active (exited)" + "enabled"

Cloudflare orange-cloud cutover — once the origin firewall is verified active:

  • Flip api.sapari.io DNS from grey (proxy OFF) to orange (proxy ON). DNS will resolve to CF egress IPs (104.21.x.x / 172.67.x.x). Required for the origin firewall to work without breaking browser → API traffic (which now arrives only via CF)
  • Disable HTTP/3 (QUIC) zone-wide via CF Speed → Optimization. Preventive: orange-cloud + HTTP/3 + SSE produces ERR_QUIC_PROTOCOL_ERROR. Verify: curl -sI https://api.sapari.io/health shows HTTP/2, no Alt-Svc: h3 header
  • Caddy SSE compression must stay off. caddy/Caddyfile deliberately omits encode gzip from the API + Dozzle blocks because gzip buffers 15-byte SSE keepalive frames until the buffer fills, never flushing. CF edge negotiates compression with the browser anyway, so removing Caddy-side compression is a no-op for bytes-on-wire and a fix for streaming. Same trap exists in nginx, Apache, any reverse proxy that batches before encoding

Verification: curl https://<hetzner-public-ip> from a non-CF source should hang and time out (DROP rule); curl https://api.sapari.io/health through CF should still succeed.

6.4 Performance baseline (post traffic accumulation)

Need ~2-3 days of real traffic before meaningful:

  • CF Workers Analytics baseline: capture categorized 4xx/5xx rates from prod traffic
  • R2 load test: validates Worker edge-cache behavior under real load
  • Re-do performance audit with real Logfire span data: replaces static-code audit with measured p95/p99

6.5 Bundle audit (parallel work)

  • Frontend bundle auditrollup-plugin-visualizer baseline + lazy splits. Chip at the main chunk; ~3 days of work

6.6 Annual rotations

  • Calendar reminder: rotate MEDIA_TOKEN_SECRET annually. Procedure in media-token-rotation.md
  • Calendar reminder: rotate SECRET_KEY annually (forces all sessions to re-auth — schedule for low-traffic window)

Operational Gotchas — Lessons from Staging

Concentrated reference of every "this bit us last time." Skim once before Phase 1; come back to the relevant section if something goes sideways.

Hetzner & the host

  • Backups: enable at provisioning time, not after. ~20% surcharge for 7-day rolling snapshots covers Redis state, RabbitMQ queues, Caddy certs, .env. Toggle at order time; flipping later leaves a no-backup window
  • Memory limits should be conservative — prefer task failure over daemon crash. Render worker's 6 GB limit is intentional. A render that needs >6 GB fails with a clean FFmpegResourceError, refunds credit, notifies user. Generous limit + daemon swap = worse UX and operator nightmare
  • Zero swap is Hetzner default; 4 GB swap prevents swapless OOM under render spikes (Phase 2.4)
  • NTP load-bearing for: (1) Caddy DNS-01 ACME signature verification (fails opaquely if skew >5 min); (2) media-token JWT TTL verification (skew >5 s breaks playback)

Tailscale

  • Reuse the staging OAuth client — tagged tag:ci, works for both environments
  • The OAuth client is fragile — if deleted from Tailscale admin, all CI deploys break. Document its existence in your ops runbook
  • SSH_HOST in GitHub Secrets is the tailnet IP (100.x.y.z), not public IPv4

Caddy

  • F1 — bind-mount inode lock (NOT YET FIXED). git reset --hard in deploy.sh replaces caddy/Caddyfile via tempfile rename, changing inode. Running container's bind-mount points at old inode. docker compose restart caddy does NOT fix it; only docker compose -f docker-compose.prod.yml up -d --force-recreate caddy does. Burn into muscle memory: any Caddyfile change → --force-recreate caddy
  • DNS-01 over HTTP-01 deliberate — no port 80, no scrambling around HTTP-01 timing
  • Healthcheck endpoint is :2020, not :2019. :2019 is admin API (disabled). Match :2020 if you ever hand-write a probe
  • X-Forwarded-For must derive from cf-connecting-ip, not raw header — CF strips original. Caddyfile translates back so backend logs see real IPs (rate-limiting, fraud, audit logs depend on it)
  • First cert acquisition takes ~30-60 s. No cert after 2 min → check docker logs caddy for ACME errors (usually CLOUDFLARE_API_TOKEN permission or DNS not propagated)
  • Do NOT add encode gzip to the API block. SSE keepalive frames never fill the gzip buffer → connection hangs. CF edge handles compression with the browser end-to-end; Caddy-side compression is redundant AND breaks streaming

Cloudflare

  • Free Universal SSL covers depth-1 subdomains only. dozzle-staging.sapari.io works free; dozzle.staging.sapari.io (depth-2) requires Advanced Cert Manager (paid). Keep ops subdomains depth-1
  • Grey vs orange cloud is consequential. Pre-firewall, backend domains (api.*) MUST be grey (Caddy terminates TLS, double-proxy breaks the cert flow). Post-firewall + Phase 6.3, api.* flips to orange to lock origin to CF's edge
  • CF API token scope creep: scope to Zone:DNS:Edit on sapari.io zone only. Token scope can't be narrowed retroactively
  • Custom domain attachment for the Worker is in dashboard, NOT wrangler.toml. Declaring [[routes]] + Pages binding = error 10144
  • HTTP/3 + SSE behind orange-cloud → ERR_QUIC_PROTOCOL_ERROR. Disable HTTP/3 zone-wide before flipping api.* orange

R2 + Worker

  • Buckets must pre-exist before Worker deploys. Wrangler doesn't create buckets; binds to existing. Missing bucket = cryptic "no such binding" runtime error
  • MEDIA_TOKEN_SECRET byte-identity is non-negotiable. Backend HS256-signs, Worker HS256-verifies. One byte different → 100% playback 401s. Verification protocol in Phase 4.4
  • Worker route order is first-match-wins. /media/v1/* before /api/* (Phase 4.5)
  • wrangler deploy may report "No deploy targets" — normal. With workers_dev = false and no [[routes]], use npx wrangler versions deploy --env <env>
  • Versioning toggle is per-bucket, manual, in dashboard. Once a file is deleted in a non-versioned bucket, no recovery path

Backend env vars

  • Production security validator hard-fails on: weak SECRET_KEY (<32 chars), POSTGRES_PASSWORD=postgres, CREATE_TABLES_ON_STARTUP=true. Soft-warns on: Redis without password, CORS *, DEBUG=true, docs in prod, weak admin creds, sessions >120 min
  • OAUTH_REDIRECT_BASE_URL: root only, no path, no trailing slash. Code appends /api/v1/auth/oauth/callback/<provider>. Adding a path doubles it; trailing slash breaks some providers' callback registration
  • MEDIA_PROXY_BASE_URL: must match user-facing domain (https://app.sapari.io), not backend domain. Wrong value → clips try to play from api.sapari.io (no Worker route there) and 404
  • CORS_ORIGINS must include the landing origin. Newsletter signup form on sapari.io POSTs to api.sapari.io/api/v1/newsletter/subscribe; if sapari.io isn't in CORS_ORIGINS, the browser blocks the request and the form renders "Misconfigured"
  • Three Redis DBs share one container by design (cache=0, rate-limiter=1, sessions=2, taskiq result=3). Don't consolidate
  • RabbitMQ user MUST NOT be guest/guest. Override via RABBITMQ_DEFAULT_USER/RABBITMQ_DEFAULT_PASS in compose env, plus TASKIQ_RABBITMQ_USER/_PASSWORD in .env. Generate with openssl rand -hex 32

Database & migrations

  • Neon Launch tier is required for prod, not Free
  • Direct endpoint over pooled — convention #16. Pooled endpoint uses PgBouncer in transaction mode, breaks asyncpg's prepared-statement cache (3-4× round-trips per query). Take direct, accept rare connection blip
  • Two-deploy rule for destructive migrations. Drop column? Rename? Ship code change first (reads neither old nor new), let it soak, ship migration second. Doing both in one deploy = rollback to prior image incompatible with new schema → stuck
  • CONFIRM_PRODUCTION_MIGRATION=yes is the env-py prod gate. Deploy scripts pass automatically. Manual migrations on server need it explicitly
  • Alembic head baked into image labels. rollback.sh reads LABEL sapari.alembic_head from target image, compares against live DB. Mismatch → abort
  • Neon time-travel = 6-hour restore window (free tier). Branch → Restore → pick a timestamp. Applies in ~30 s

TaskIQ + RabbitMQ

  • rabbitmq_delayed_message_exchange plugin is REQUIRED. Without it, SmartRetryMiddleware's exponential backoff is silently ignored — failed downloads retry immediately, hit same error, get dropped. Plugin enabled via rabbitmq/Dockerfile
  • Broker (RabbitMQ) and result backend (Redis) deliberately separate. Redis crash → results lost, queue survives, tasks retry. Consolidated would lose both on Redis incident
  • Priority queue mapping: viral=3, creator=2, hobby=1, free=0. Set via .kicker().with_labels(priority=N).kiq(...). New tier means updating both code enum AND queue priority levels in compose
  • No eager tasks import in module __init__.py (Convention #17). Causes a circular import via infrastructure.taskiq → workers/shared/context → modules.email.service → modules.email.__init__ → tasks → infrastructure.taskiq mid-load. Workers crash-loop silently under taskiq's process manager (containers report Up, RestartCount=0, but no work progresses). CI grep enforces

Stripe

  • Test-mode keys on prod = silent failure. Live-mode keys on staging = real charges
  • Webhook signing secret per-endpoint, not per-account. Rotating endpoint generates new secret; old stops working
  • Webhook signing secret mismatch is silent. Backend returns 401, Stripe retries a few times then gives up. User paid; sees no entitlements. Symptom appears 5-30 min post-charge
  • Idempotency keys on Stripe API write calls — backend uses these. Add new write paths with idempotency keys
  • Tier 3 + Tier 4 products are auto-seeded on first deploy. Don't pre-create
  • _to_dict() boundary helper in webhooks.py converts Stripe's StripeObject to a plain dict at the webhook entry point. Required because StripeObject is dict-like but not a dict — downstream type-checked code (FastCRUD, Pydantic models) raises on direct pass-through

Postmark

  • DKIM + SPF + Return-Path: all three or none. Verify all green before deploying
  • Sender reputation builds slowly. Don't blast 1000+ users on day 1 — drip
  • Separate Postmark server token per environment. Sharing muddles deliverability stream
  • Email broker has no SmartRetry. A Postmark / RabbitMQ outage during POST /newsletter/subscribe leaves the subscriber row in PENDING. The _recover_pending_newsletter_subscribers cron sweep is the recovery path (re-queues confirmation emails for rows >30 min old, max 3 attempts)
  • CAN-SPAM postal address + entity name in base.html + base.txt footers are legally required. The 5-case test_template_footer.py pins the copy

Observability

  • SQLAlchemy instrumentation: ON. Redis: OFF. SQLAlchemy spans solve "why is this endpoint slow" (high signal). Redis ops sub-millisecond, uniform (high volume, low signal). Per-worker kill-switches exist
  • Span taxonomy = 6 categories: pipeline.parent / step.* / taskiq.* / service.* / ext.* / cron. Hand-named spans break dashboard filters
  • Per-worker service.name distinguishes workers in queries. Web=sapari-api; workers=sapari-{email,analysis,render,download,proxy,asset-edit}; scheduler=sapari-scheduler
  • Beszel shoutrrr Discord URL format — token first, then ID, in discord:// URI form

CI/CD

  • Image tag strategy: floating production tag for normal deploys, SHA tag for rollbacks. Always pin SHA in rollback.yml inputs
  • Public GHCR images today; flip to private requires docker login
  • Migrations run BEFORE container restart in deploy.sh. Migration failure aborts deploy; old code keeps running on old schema. Don't reorder
  • git clone on the server, not rsync-only. Operators can SSH in, edit docker-compose.prod.yml, restart without CI. When CI is down, the server's git repo is your unblock

DNS + cutover

  • Lower TTL on api.sapari.io to 60s before cutover (Phase 2.2). Bump to Auto in Phase 6 once stable
  • CF Pages CNAME flattening is automatic at apex. sapari.io shows as A record, not CNAME. CF feature, not bug
  • Brief downtime per deploy is accepted. No blue/green, no Swarm. Plain docker compose up -d → ~10 s API restart → frontend shows maintenance screen → React Query retries on resume

Rollback Plan

Three layers.

Code rollback (most common):

ssh deploy@<hetzner-ip>
cd ~/sapari
./scripts/deployment/rollback.sh <previous-sha>
Aborts if new image's Alembic head differs from live DB. Pass --ignore-migration-warning if schema is backwards-compatible.

Schema rollback (Neon time travel): - Free/Launch tier: 6-hour restore window - Neon dashboard → Branches → Restore to a point in time — applies in ~30 seconds - Use this if alembic downgrade -1 isn't safe

DNS-level rollback (last resort): - TTL was 60s during cutover; bump back to old IP via CF DNS edit - For "the new server is fundamentally broken" — almost never the right answer if Caddy/health checks are green but app behavior is wrong


Environment Variable Reference

Full .env template for production. Bold = enforced by security validator (app won't start without it set correctly).

# === App ===
ENVIRONMENT=production
DEBUG=false
SECRET_KEY=<openssl rand -base64 64>             # 32+ chars
FRONTEND_URL=https://app.sapari.io
API_PUBLIC_URL=https://api.sapari.io             # newsletter email link base
LANDING_URL=https://sapari.io                    # newsletter redirect target
CONTACT_EMAIL=hello@sapari.io
LOG_LEVEL=INFO

# === Image registry ===
GHCR_OWNER=benavlabs
IMAGE_TAG=production

# === Database (Neon Launch tier, direct endpoint) ===
DATABASE_URL=postgresql+asyncpg://<user>:<password>@<direct-endpoint>/<db>?ssl=require
# Note: use ?ssl=require — NOT ?sslmode=require (psycopg2 syntax). Drop any
# &channel_binding=require Neon's UI may copy in — asyncpg auto-negotiates SCRAM.
CREATE_TABLES_ON_STARTUP=false
POSTGRES_POOL_SIZE=20                            # per-process
POSTGRES_MAX_OVERFLOW=10                         # per-process

# === Redis ===
CACHE_BACKEND=redis
CACHE_REDIS_HOST=redis
CACHE_REDIS_PASSWORD=<random>
RATE_LIMITER_BACKEND=redis
RATE_LIMITER_REDIS_HOST=redis
RATE_LIMITER_REDIS_DB=1
RATE_LIMITER_REDIS_PASSWORD=<random>
SESSION_BACKEND=redis
SESSION_REDIS_HOST=redis
SESSION_REDIS_DB=2
SESSION_REDIS_PASSWORD=<random>
SESSION_SECURE_COOKIES=true
SESSION_TIMEOUT_MINUTES=480

# === RabbitMQ ===
TASKIQ_BROKER_TYPE=rabbitmq
TASKIQ_RABBITMQ_HOST=rabbitmq
TASKIQ_RABBITMQ_USER=sapari                      # NOT guest
TASKIQ_RABBITMQ_PASSWORD=<random>                # NOT guest
TASKIQ_REDIS_HOST=redis
TASKIQ_REDIS_DB=3

# === Storage (R2) ===
STORAGE_ENDPOINT=https://<account-id>.r2.cloudflarestorage.com
STORAGE_ACCESS_KEY_ID=<from R2 dashboard>
STORAGE_SECRET_ACCESS_KEY=<from R2 dashboard>
STORAGE_BUCKET_RAW=sapari-raw
STORAGE_BUCKET_EXPORTS=sapari-exports
STORAGE_BUCKET_ASSETS=sapari-assets
STORAGE_MAX_UPLOAD_SIZE_MB=10240                 # 10 GB

# === Video duration cap ===
MAX_VIDEO_DURATION_MINUTES=90                    # aligns with render timeout

# === Media proxy (CF Worker) ===
MEDIA_TOKEN_SECRET=<openssl rand -base64 32>     # MUST match Worker secret byte-for-byte
MEDIA_TOKEN_KID=v1
MEDIA_TOKEN_TTL_SECONDS=300
MEDIA_PROXY_BASE_URL=https://app.sapari.io

# === CORS ===
# Must include landing origin or newsletter signup POST is CORS-blocked.
CORS_ORIGINS=https://app.sapari.io,https://sapari.io
CORS_ALLOW_CREDENTIALS=true

# === Caddy / TLS ===
CLOUDFLARE_API_TOKEN=<scoped Zone:DNS:Edit on sapari.io>

# === OAuth ===
OAUTH_REDIRECT_BASE_URL=https://app.sapari.io
OAUTH_GOOGLE_CLIENT_ID=<from Google Cloud Console>
OAUTH_GOOGLE_CLIENT_SECRET=<from Google Cloud Console>
OAUTH_GITHUB_CLIENT_ID=<from GitHub OAuth App>
OAUTH_GITHUB_CLIENT_SECRET=<from GitHub OAuth App>

# === Stripe (live mode) ===
STRIPE_SECRET_KEY=sk_live_...
STRIPE_PUBLISHABLE_KEY=pk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
STRIPE_TEST_MODE=false

# === Email (Postmark) ===
POSTMARK_SERVER_TOKEN=<from Postmark>
EMAIL_SENDER_ADDRESS=hello@sapari.io
EMAIL_SENDER_NAME=Vitoria from Sapari
EMAIL_TEST_MODE=false

# === AI ===
OPENAI_API_KEY=sk-...
DEEPSEEK_API_KEY=...

# === Admin ===
ADMIN_USERNAME=<not 'admin'>
ADMIN_EMAIL=<address on sapari.io>
ADMIN_PASSWORD=<12+ chars, strong>
ADMIN_EMAIL_DOMAIN=sapari.io

# === Observability ===
LOGFIRE_TOKEN=<from logfire.pydantic.dev>
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SERVICE_NAME=sapari-api

# === Feature flags / safety ===
PRODUCTION_SECURITY_VALIDATION_ENABLED=true
ENABLE_DOCS_IN_PRODUCTION=false

# === Resource limits (parameterized in docker-compose.prod.yml) ===
# Defaults in compose match staging (4 vCPU / 16 GB). These overrides are for prod (CPX62).

WEB_WORKERS=4
WEB_MEMORY=6g
WEB_CPUS=4.0
WEB_MEMORY_RESERVATION=1g

RENDER_MEMORY=6g
RENDER_CPUS=4.0
RENDER_FFMPEG_THREADS=4
RENDER_MEMORY_RESERVATION=2g

PROXY_MEMORY=3g
PROXY_CPUS=3.0
PROXY_FFMPEG_THREADS=3

DOWNLOAD_MEMORY=2g
DOWNLOAD_CPUS=2.0

ANALYSIS_MEMORY=2g
ANALYSIS_CPUS=1.5
ANALYSIS_TASKIQ_CONCURRENCY=4

REDIS_MAXMEMORY=500mb
REDIS_MEMORY=768m
REDIS_CPUS=0.5

RABBITMQ_MEMORY=1g

CADDY_MEMORY=256m
CADDY_CPUS=0.25

See also

For deeper detail on any phase: