Production Launch Runbook¶
End-to-end procedure for taking Sapari from "staging works" to "production live" on a self-managed Hetzner CPX62 (16 vCPU / 32 GB) sized for up to 1000 concurrent users.
This document is the launch-day spine: pace yourself through Phases 1-6 in order, and jump into a dedicated runbook (linked at the bottom) when a step needs more depth. Time budget: ~6-8 hours of focused work spread across a day or two; Phase 5 (cutover) takes ~2 hours including smoke testing.
For ongoing operations (steady-state deploys, rollbacks, monitoring) see deployment.md. This runbook only covers the first-time launch and the operational gotchas that go with it.
Topology¶
┌──────────────────────────┐
│ Cloudflare DNS (sapari.io)│
└────────────┬─────────────┘
│
┌──────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
sapari.io app.sapari.io api.sapari.io
(Pages: landing) (Pages: frontend) (grey cloud)
│ │
│ /api/*, /media/v1/* │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ CF Worker │ │ Hetzner CPX62 │
│ (sapari-proxy- │ │ 16 vCPU / 32 GB │
│ production) │────────────│ Caddy → web │
└────────┬─────────┘ │ + 6 TaskIQ │
│ │ workers │
│ R2 bindings │ + scheduler │
▼ │ + Redis │
┌──────────────┐ │ + RabbitMQ │
│ R2 buckets │ └────────┬─────────┘
│ raw/exports/ │ │
│ assets │ │ DATABASE_URL
└──────────────┘ ▼
┌──────────────────┐
│ Neon Postgres │
│ (Launch tier) │
└──────────────────┘
External integrations: Stripe (webhooks → api.sapari.io), Postmark (email), OpenAI + DeepSeek (AI), Logfire (observability), Sentry (frontend, optional), Discord (Beszel alerts, optional), Tailscale (CI/CD SSH).
Capacity Sizing — Target¶
Goal: support up to 1000 concurrent users decently usable on a single CPX62 box.
Per-user resource footprint (verified in infrastructure/events/subscriber.py:266-312):
- 1 long-lived SSE response held by one uvicorn worker
- 2 Redis pubsub connections (notification channel + asset channel, fan-in pattern)
- +1 more Redis pubsub conn when in the editor (subscribe_to_project)
Aggregate at 1000 concurrent users: - ~3000-4000 file descriptors held by the web container - ~2300 Redis pubsub connections - ~50-100 simultaneous active request senders out of 1000 (rest idle on SSE) - Realistic concurrent DB queries: ~30-50
Box budget:
| CPU (vCPU) | Memory (GB) | |
|---|---|---|
| Web (FastAPI, 4 workers) | 4.0 | 6.0 |
| 6 TaskIQ workers + scheduler | 11.6 | 14.5 |
| Caddy + Redis + RabbitMQ | 1.25 | 2.0 |
| Dozzle + Beszel hub + agent | 0.45 | 0.4 |
| Allocated | ~17.3 | ~22.9 |
| CPX62 | 16 | 32 |
| Headroom (incl. OS/page cache/FFmpeg surge) | (-1.3 oversubscription, fine on async I/O) | 9.1 (28%) |
CPU oversubscription is fine on shared-vCPU CPX62 because Sapari is async-I/O-bound (asyncpg, httpx, Redis) — workers spend most cycles waiting on network. The CPU caps are ceilings, not reservations.
Architectural Ceilings — Acknowledged for v1¶
Limits inherent to the single-box architecture. Tuning does NOT solve them; they need product/UX decisions or post-v1 horizontal scaling. Listed up front so they don't get lost:
- Render throughput is single-threaded. One render worker, one FFmpeg job at a time. A 60-min source render takes ~30-60 min wallclock. At 1000 concurrent users, even 0.5% triggering a render in the same hour = 5 queued, last user waits 2+ hours. v1 mitigation: explicit "Queued" UX (already in export-progress UI — verify copy in smoke test). Post-v1 fix: render worker replicas.
- Proxy throughput is single-threaded (same shape, lower demand — only triggers on codec mismatch on upload).
- External API rate limits:
- OpenAI Whisper: ~50 RPM on Tier 1. Verify Tier 2+ for 1000-user target.
- DeepSeek: verify plan limits.
- Postmark: realistic email volume at 1000 users is 2-5k/month (auth + billing events; render/analysis are SSE, not email). Starter (\(15/mo, 10k cap) covers steady state; 50k tier (\)50/mo) buys launch-burst headroom.
- SSE reconnect storms. Web container restart drops 1000 SSE connections simultaneously; frontend exponential backoff (1s, 2s, 4s, 8s, 16s, cap 30s) helps but resets on
onopen— second-wave thunder is possible. Mitigation (jitter on first retry) is post-launch. - RabbitMQ queue depth visibility. No default operator dashboard for queue depth. Mgmt UI at
:15672shows it but isn't exposed. Add queue-depth Logfire span or admin endpoint post-launch. - Single-box failure mode. OOM, FFmpeg subprocess crash, RabbitMQ memory pressure — any one takes down some-or-all users. v1 accepts; manual recovery in §Operational Gotchas.
Prerequisites — Accounts You Need¶
Sign up for these before starting Phase 1. None require coordination with another step.
- Hetzner Cloud — for the production server (CPX62 or larger)
- Cloudflare — DNS, Pages, Workers, R2, Access
- Neon — Launch tier (\(0.106/CU-hour; ~\)76/mo if you set min CU=1, default is scale-to-zero=$0 idle)
- Stripe — live mode activated (KYC complete, payouts configured)
- Postmark — sender domain verification ready; Starter plan ($15/mo, 10k emails) sufficient for 1000 users in steady state
- OpenAI — billing-enabled account; Tier 2+ recommended for 1000-user Whisper RPM
- DeepSeek — billing-enabled account
- Logfire — free tier is fine to start
- GitHub — deploy key + production environment secrets
- Tailscale — OAuth client configured (same one as staging works for prod)
- Optional: Discord — webhook URL for Beszel alerts
- Optional: Sentry — frontend project for error tracking
Phase 1 — Provision External Services¶
Order doesn't matter; everything is independent. Aim to finish in one sitting so secrets are fresh.
1.1 Neon Postgres (Launch tier)¶
- Create new Neon project named
sapari-production - Region: pick the same one as the Hetzner box (Ashburn →
us-east-1, Hillsboro →us-west-2, Falkenstein →eu-central-1) - Subscribe to Launch tier ($0.106/CU-hour). Free tier is too restrictive for prod
- Scale-to-zero: keep enabled (default 5 min).
pool_pre_ping=True(session.py:14-19) +pool_recycle=300catch stale-after-pause connections before any query runs. Watch for Logfiredb.*p99 spikes post-launch; if seen, set min CU=1 ($76/mo always-on) - Copy the direct endpoint connection string (NOT pooler) — convention #16
- Convert to async form:
postgresql+asyncpg://... - Save as
DATABASE_URLfor Phase 3
Connection budget: at min CU=1, max_connections=419. Cluster ceiling at WEB_WORKERS=4 is web 4×30 + workers 6×30 + scheduler 30 = 330. Comfortable 21% headroom. Auto-scales to 16 CU under burst.
1.2 Cloudflare R2 (3 buckets)¶
- Create buckets:
sapari-raw,sapari-exports,sapari-assets - Enable versioning on all three before any data lands. R2 doesn't version by default; an accidental delete or overwrite is gone forever. Toggle at the bucket level in the dashboard. CANNOT be retroactively enabled to recover already-deleted files
- Create an API token scoped to Object Read & Write on all three buckets only — narrower than account-wide. Token scope is one-time at creation; can't be narrowed retroactively
- Save:
STORAGE_ACCESS_KEY_ID,STORAGE_SECRET_ACCESS_KEY,STORAGE_ENDPOINT(thehttps://<account-id>.r2.cloudflarestorage.comform)
1.3 Cloudflare Worker secret (placeholder)¶
-
MEDIA_TOKEN_SECRET=$(openssl rand -base64 32)— save it; this exact value goes both in the backend.envand as a Worker secret. Byte-identical is load-bearing
1.4 Stripe (live mode)¶
- Toggle Stripe Dashboard to live mode
- Copy
sk_live_...andpk_live_...→STRIPE_SECRET_KEY,STRIPE_PUBLISHABLE_KEY. Test-mode keys (sk_test_) on production fail every transaction silently; live-mode keys on staging will charge real cards. Verify mode visually before copying - Create webhook endpoint at
https://api.sapari.io/api/v1/webhooks/stripe— direct to backend (api.*, NOTapp.*). URL won't resolve yet; create it anyway. The CF Worker proxy is for browser API calls; webhooks should hit Caddy → backend directly - Subscribe to events:
customer.subscription.updated,customer.subscription.deleted,invoice.payment_failed,charge.refunded,checkout.session.completed - Copy webhook signing secret (different from API keys; one per endpoint) →
STRIPE_WEBHOOK_SECRET. A wrong webhook secret silently breaks payment processing — backend returns 401, Stripe retries a few times then gives up, subscriptions don't activate. Symptom: "user paid but doesn't see credits." Test reachability post-deploy via Stripe Dashboard's "Send test webhook" button (Phase 4.7) - Set
STRIPE_TEST_MODE=false - Tier 3 (Creator) and Tier 4 (Viral) products + prices are auto-seeded by
seed_stripe_products.pyon first deploy; do not pre-create
1.5 Postmark¶
DNS propagation can take hours — start early in Phase 1 so it's done by Phase 5.
- Add
sapari.ioas a sender domain - Add all three DNS records to Cloudflare. Postmark Dashboard surfaces exact values:
- DKIM (
<selector>._domainkey.sapari.ioTXT) — signs outbound mail - SPF (TXT on apex) — authorizes Postmark to send on your behalf
- Return-Path (CNAME) — bounce-handling subdomain
- Wait until all three show green in Postmark dashboard before proceeding. Skipping = mail in spam or bounced
- Plan tier: Starter (\(15/mo, 10k emails) covers 1000 users in steady state (auth + billing events only — render/analysis/asset events are SSE, not email). Realistic volume ~2-5k emails/month. Upgrade to 50k tier (\)50/mo) if you want launch-burst headroom
- Use a separate Postmark server token per environment (one for staging, one for production). Sharing muddles deliverability stream
- Save the production server token →
POSTMARK_SERVER_TOKEN - Sender reputation warning: even with green DKIM/SPF/Return-Path, Postmark starts new domains with low reputation. ISPs throttle. Don't blast 1000+ users on day 1 — drip launch announcements
1.6 OpenAI + DeepSeek¶
- Create API key at platform.openai.com →
OPENAI_API_KEY. Tier 2+ recommended for 1000-user target: Tier 1 caps Whisper at ~50 RPM; sustained burst at scale could 429. OpenAI auto-tiers up with usage history; if you've been on Tier 1, request Tier 2 explicitly via support before launch - Create API key at platform.deepseek.com →
DEEPSEEK_API_KEY. Verify plan limits cover ~80 analyses/hour throughput target - Set hard spending caps in both dashboards (recommended: $200/mo OpenAI, $50/mo DeepSeek to start; tune after real usage)
1.7 Logfire¶
- Create project named
sapari-production(or share with staging using environment tag) - Get write token →
LOGFIRE_TOKEN - If sharing with staging: set
LOGFIRE_ENVIRONMENT=productionin prod.envso spans are tagged
1.8 Optional but recommended¶
- Sentry — frontend project, copy DSN. Add to
frontend/.env.productionasVITE_SENTRY_DSN. Frontend'sperfMarks.tsalready adds breadcrumbs under categoryperf - Discord webhook — create in your Discord server. Beszel uses shoutrrr format, NOT raw HTTPS URL. Beszel silently swallows malformed URLs:
Note token first, then ID — opposite of the URL form. Verify alerts arrive (
Raw Discord URL: https://discord.com/api/webhooks/<webhook-id>/<token> Shoutrrr format: discord://<token>@<webhook-id>stress -c 4 -t 60triggers CPU>80% alert)
Phase 2 — Hetzner CPX62 + Tailscale + Caddy Setup¶
Allow ~90 min start to finish; longer if first time.
2.1 Pick the box¶
- Size: CPX62 — 16 shared vCPU, 32 GB RAM, 640 GB SSD, 20 TB transfer, $59.49/mo. Sized to support 1000 concurrent users with the production tuning. CCX33 (8 vCPU dedicated, 32 GB) is a viable swap if you prefer dedicated CPU at higher cost
- OS: Ubuntu 24.04 LTS —
setup-server.shchecks forssh.service(24.04+) vssshd.service(older); 24.04 is the tested baseline - Region: match the Neon region from Phase 1.1. The DB-to-server hop is on the critical path for every API request
- SSH key: add yours to the Hetzner project (cloud-init drops it into
root@authorized_keysautomatically) - Firewall: attach the existing
firewall-tailscaleHetzner Cloud Firewall at server creation if you have one. Defense in depth on top of host-level UFW (whichsetup-server.shconfigures separately). Rule set should cover443/tcp+443/udp(HTTP/3) from anywhere,22/tcpfrom Tailscale CGNAT (100.64.0.0/10), and IPv6 sources (::/0) — not just IPv4 - Enable automated backups at order time — Hetzner adds ~20% to monthly cost (~$12/mo for CPX62) for 7-day rolling snapshots of the entire disk. Toggle during provisioning; flipping it on after leaves an initial unprotected window. Dramatically cheaper than a custom backup pipeline; covers Redis, RabbitMQ, Caddy certs,
.envin one operation - Note the IPv4 — this becomes
api.sapari.io's A record
2.2 Add the DNS record now (low TTL, grey cloud)¶
Do this before setup-server.sh so DNS has time to propagate by the time Caddy needs it for ACME DNS-01.
- Cloudflare DNS for
sapari.io→ add A record: - Name:
api - IPv4:
<hetzner-ip> - Proxy: OFF (grey cloud) — Caddy terminates TLS itself; double-proxying through CF orange-cloud breaks the cert flow at this point. Once stable, flipping to orange-cloud is a post-launch hardening step (see Phase 6.3)
- TTL: 60 seconds — fast rollback option during cutover. Bump to Auto in Phase 6
2.3 Run setup-server.sh¶
setup-server.sh is idempotent (safe to re-run) but takes a required --my-ip flag — whitelists only that IP for SSH on port 22. Wrong IP = locked out (Hetzner has console rescue if needed).
# Find your operator IP first:
curl -s ifconfig.me
# Then on the server, as root:
ssh root@<hetzner-ip>
git clone https://github.com/benavlabs/sapari.git /opt/sapari-bootstrap
cd /opt/sapari-bootstrap
./scripts/deployment/setup-server.sh --my-ip <your-operator-ip> --hostname sapari-prod
What it does (verify each):
| Step | What |
|---|---|
| Hostname | Sets to sapari-prod; updates /etc/hosts |
deploy user |
Created with sudo + docker group, passwordless sudo (intentional for CD; safe because SSH is key-only and tailnet-gated) |
| SSH hardening | Disables root login, disables password auth, requires pubkey |
| UFW firewall | Default deny inbound; allows 443/tcp from anywhere; 22/tcp only from --my-ip. No port 80 — Caddy uses DNS-01 ACME |
unattended-upgrades + fail2ban |
Auto-security-patches; SSH brute-force throttling |
| Docker | Official get.docker.com install |
| GitHub deploy key | Generates Ed25519 keypair at /home/deploy/.ssh/github_deploy_key; pre-seeds known_hosts for github.com. Prints the public key at the end — manually add to repo's Deploy Keys |
| Origin firewall service | Installs sapari-docker-firewall.service (oneshot systemd unit, After=docker.service) that locks the Docker DOCKER-USER chain :443 to Cloudflare's published egress IPs. See Phase 6.3 |
What it does NOT do (manual, in 2.4): - Install or join Tailscale - Add swap - Verify NTP - Configure GHCR auth (not needed — images are public)
- Copy printed Ed25519 public key
- GitHub repo → Settings → Deploy keys → Add deploy key → paste, name it
sapari-prod, leave "Allow write access" unchecked → Add - Verify deploy user can clone:
ssh deploy@<hetzner-ip> "ssh -T git@github.com"should print "Hi! ..."
2.4 Manual hardening — swap, NTP, Tailscale¶
Swap (prevents FFmpeg OOM):
Hetzner cloud images ship with zero swap. Render worker FFmpeg + uploaded video can spike past per-container limits during a large render → swapless OOM takes down the whole box.
# As root or with sudo:
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
swapon --show # verify
- 4 GB swap created and persisted across reboots
NTP (load-bearing for JWT verification):
Media token JWTs have a 5-minute TTL. Clock skew of even ~10s between server and CF Worker breaks playback ("token expired" 401s on freshly-minted URLs). systemd-timesyncd is on by default in Ubuntu 24.04, but verify:
- Clock is NTP-synced
Tailscale (CI access path):
Tailscale is pre-installed on Hetzner Ubuntu but not running. Bring up manually:
Then in Tailscale admin console:
- Machines → find
sapari-prod→ Edit ACL tags → addtag:server - Verify ACL grants
tag:ci → tag:server(or allow-all). The OAuth client used by GitHub Actions is reusable from staging - Note the tailnet IP (
100.x.y.z) — this is what GitHub Actions SSHs to, not the public IPv4. Save asSSH_HOSTGitHub secret for the production environment
2.5 GitHub Actions secrets for the production environment¶
GitHub repo → Settings → Environments → New environment named production. Add:
-
SSH_HOST— tailnet IP from 2.4 (not public IPv4) -
SSH_KEY— private half of a fresh Ed25519 keypair generated locally; install public half on server withssh-copy-id deploy@<tailnet-ip>(over Tailscale). Keep this key separate from your operator key — CI-only, revocable independently -
TS_OAUTH_CLIENT_IDandTS_OAUTH_SECRET— same as staging environment
Set deploy gate (typed YES confirmation, required reviewer, deploy window) on the production environment's protection rules if desired.
2.6 Caddy — what to expect, and the F1 gotcha¶
Caddy is one of the containers first-deploy.sh will start in Phase 3. A few things to know:
- Config:
caddy/Caddyfile(committed). Reverse-proxiesapi.sapari.io→web:8000, setsX-Forwarded-Forfrom Cloudflare'scf-connecting-ipso backend logs see real client IPs - TLS via Cloudflare DNS-01: needs
CLOUDFLARE_API_TOKENin.env(Phase 3.1) withZone:DNS:Editpermission onsapari.ioonly. First deploy: cert acquisition takes ~30-60 seconds; checkdocker logs caddyif no successful cert event - F1 (known gotcha — not yet fixed in deploy.sh):
docker-compose.prod.ymlbind-mounts./caddy/Caddyfile. Whendeploy.shrunsgit reset --hard, git replaces the file via tempfile rename — inode changes, container's bind mount points at stale inode. Caddy doesn't see new config until the container is force-recreated.
Workaround: any time you change the Caddyfile, after deploy completes, run on the server: docker compose -f docker-compose.prod.yml up -d --force-recreate caddy. Burn into muscle memory.
2.7 Pre-flight before Phase 3¶
-
dig api.sapari.io +shortfrom anywhere returns the Hetzner IPv4 -
ssh deploy@<tailnet-ip>works (over Tailscale) -
ssh deploy@<tailnet-ip> "docker --version"prints a version (deploy indockergroup) -
ssh deploy@<tailnet-ip> "swapon --show"shows the 4 GB swapfile -
ssh deploy@<tailnet-ip> "timedatectl | grep synchronized"showsyes - GitHub repo Deploy keys list contains
sapari-prodEd25519 key - GitHub
productionenvironment has all four secrets set
Phase 3 — First Deploy¶
This is where the prod app first runs. Plan ~1 hour with debugging margin.
3.1 Clone repo + write .env.production¶
ssh deploy@<hetzner-ip> # via Tailscale
git clone https://github.com/benavlabs/sapari.git ~/sapari
cd ~/sapari
cp backend/.env.production.example .env
- Open
.envand fill in every required value. Use the Environment Variable Reference at the bottom as the checklist
Critical values the security validator enforces (app refuses to start otherwise):
- SECRET_KEY — 32+ chars; python -c "import secrets; print(secrets.token_urlsafe(64))"
- POSTGRES_PASSWORD — must NOT be postgres (or pass full DATABASE_URL instead)
- CREATE_TABLES_ON_STARTUP=false
- ENVIRONMENT=production
- DEBUG=false
- STRIPE_TEST_MODE=false
- SESSION_SECURE_COOKIES=true
- ADMIN_USERNAME — NOT admin
- ADMIN_PASSWORD — 12+ chars, not in weak password list
- TASKIQ_RABBITMQ_USER and _PASSWORD — NOT guest/guest
Differs from staging — double-check:
- OAUTH_REDIRECT_BASE_URL=https://app.sapari.io
- MEDIA_PROXY_BASE_URL=https://app.sapari.io (no trailing slash)
- FRONTEND_URL=https://app.sapari.io
- API_PUBLIC_URL=https://api.sapari.io (newsletter confirm/unsubscribe email links)
- LANDING_URL=https://sapari.io (newsletter confirm/unsubscribe redirect targets)
- CORS_ORIGINS=https://app.sapari.io,https://sapari.io (must include landing or newsletter signup POST is CORS-blocked)
- LOGFIRE_ENVIRONMENT=production
- All Stripe keys live (sk_live_, pk_live_, whsec_ from live webhook)
Tuning values to set explicitly for prod (these turn the parameterization on — without them, compose uses staging defaults):
- WEB_WORKERS=4, WEB_MEMORY=6g, WEB_CPUS=4.0, WEB_MEMORY_RESERVATION=1g
- RENDER_MEMORY=6g, RENDER_CPUS=4.0, RENDER_FFMPEG_THREADS=4, RENDER_MEMORY_RESERVATION=2g
- PROXY_MEMORY=3g, PROXY_CPUS=3.0, PROXY_FFMPEG_THREADS=3
- DOWNLOAD_MEMORY=2g, DOWNLOAD_CPUS=2.0
- ANALYSIS_MEMORY=2g, ANALYSIS_CPUS=1.5, ANALYSIS_TASKIQ_CONCURRENCY=4
- REDIS_MAXMEMORY=500mb, REDIS_MEMORY=768m, REDIS_CPUS=0.5
- RABBITMQ_MEMORY=1g
- CADDY_MEMORY=256m, CADDY_CPUS=0.25
- POSTGRES_POOL_SIZE=20, POSTGRES_MAX_OVERFLOW=10
- STORAGE_MAX_UPLOAD_SIZE_MB=10240
- MAX_VIDEO_DURATION_MINUTES=90
3.2 Verify GHCR image pull¶
GHCR images are public today — no docker login needed.
- Pull succeeds. If you ever flip the repo to private: PAT with
read:packages, thenecho $TOKEN | docker login ghcr.io -u <user> --password-stdin
3.3 Run first-deploy.sh¶
Order: pull image → migrate (alembic upgrade head) → seed (tiers, admin user, Stripe products) → start all services → health check.
- First-time seed creates: 4 tiers (free/hobby/creator/viral), the admin user, Stripe products for Creator + Viral
- Watch for migration errors — they abort the deploy
- If
seed_stripe_products.pyerrors →STRIPE_SECRET_KEYis wrong. Fix and re-run just the seed:./scripts/deployment/run-task.sh backend/scripts/seed_stripe_products.py
3.4 Verify the box is alive (before DNS routes traffic)¶
# On the server:
curl -f http://localhost:8000/health # liveness
curl -f http://localhost:8000/health/ready # readiness — DB + Redis + RabbitMQ + storage all green
# From your laptop, hitting IP directly (Caddy will reject; cert is for api.sapari.io):
curl -k https://<hetzner-ip>/health -H "Host: api.sapari.io"
- Liveness: 200
- Readiness: 200 with all subsystems green
- If Caddy hasn't obtained the cert, check
docker logs caddy— DNS-01 needs validCLOUDFLARE_API_TOKENandapi.sapari.ioresolving to this IP
3.5 Verify externally once DNS propagates¶
- Both return 200 with valid certs (no
-kneeded)
Phase 4 — Cloudflare Worker + Pages + DNS¶
The Worker handles /api/* proxying to api.sapari.io and /media/v1/* for R2 media. Pages serves the frontend at app.sapari.io and the landing at sapari.io.
4.1 Frontend build env¶
- In
frontend/, create or update.env.production(committed file is fine; no secrets): VITE_API_BASE_URL=https://app.sapari.io(Worker proxies/api/*to backend)VITE_SENTRY_DSN=<from 1.8>if using Sentry- Push to
main(or your prod branch) — Cloudflare Pages builds + deploys automatically
4.2 Frontend custom domain¶
- CF Pages dashboard → frontend project → Custom domains → add
app.sapari.io - CF auto-creates the CNAME; verify it propagates
4.3 Landing page¶
- Same flow for
landing/Pages project — custom domainsapari.io(apex). Pages handles apex via CNAME flattening - Set
PUBLIC_API_ORIGIN=https://api.sapari.ioin the landing Pages project env vars so the newsletter signup form POSTs to the correct origin (empty value renders an inline "Misconfigured" error on the signup button)
4.4 Cloudflare Worker — secret + deploy¶
# From your laptop, in worker/ directory:
cd worker
npx wrangler secret put MEDIA_TOKEN_SECRET_V1 --env production
# Paste the SAME value you put in backend .env as MEDIA_TOKEN_SECRET — byte-identical
npm run deploy:production
# If wrangler reports "No deploy targets":
npx wrangler versions deploy --env production
- Verify fingerprints match. Backend logs:
docker logs sapari-backend | grep media_tokenshould printmedia_token: active=v1 registry=[v1:<8-char-hex>]. Thennpx wrangler tail --env productionand load a clip in the browser — same fingerprint should appear
4.5 Cloudflare Worker — route patterns (dashboard, NOT wrangler.toml)¶
Worker [[routes]] in wrangler.toml is intentionally empty — Pages binding can't coexist with route patterns there (CF returns error 10144 if both are declared).
- CF Dashboard → Workers & Pages →
sapari-proxy-production→ Settings → Domains & Routes → Add (order matters — first match wins): app.sapari.io/media/v1/*← add this firstapp.sapari.io/api/*← then this
If /api/* comes before /media/v1/*, media requests like /media/v1/<jwt> match the API rule first and get proxied to the backend. Symptom: 404 on every clip play. No error message at deploy — purely order-of-rules.
Without route patterns at all, requests bypass the Worker and hit Pages' SPA fallback (returns index.html for /api/* — symptom is "API calls return HTML").
4.6 Cloudflare Access policies¶
- Production app (
app.sapari.io): NO Access policy. Public -
internal-docs.sapari.io: same Access app as staging (GitHub teambenavlabs/Sapari) - If/when adding
dozzle-prod.sapari.ioandbeszel-prod.sapari.io: gate via the same Access app
4.7 Stripe webhook reachability test¶
- Stripe Dashboard → Webhooks → your endpoint → "Send test webhook" → pick
customer.subscription.updated - Check
docker logs sapari-backend | grep webhook— should show signature-verified receipt
Phase 5 — Cutover & Smoke Test¶
Live moment. ~2 hours including verification.
5.1 Pre-flight (do once, before the cutover hour)¶
- All Phases 1-4 boxes ticked
-
https://api.sapari.io/health/readyreturns 200 from outside the network -
https://app.sapari.ioloads the frontend (with Worker routes,/api/*proxies) - Stripe live webhook test passes
- Postmark send-test passes
- TTL on
api.sapari.iois 60s (set in 2.2)
5.2 Smoke test the running prod app (before announcing)¶
Run through every critical user journey while the app is live but unannounced. Use a real (non-admin) user account or create one fresh.
Auth + onboarding - [ ] Signup → email verification → login → logout → login again - [ ] OAuth: Google login (if configured), GitHub login (if configured)
Editor cold paths
- [ ] Create two projects, switch between them — editor doesn't blank, no wrong-project mutation if you spam delete-edit
- [ ] Cold load /projects/:uuid: editor skeleton renders, not generic spinner
- [ ] Cold load /projects/new: list skeleton, NOT editor skeleton
Upload paths
- [ ] Upload < 25 MiB clip (single-PUT path) → presign + PUT, confirm endpoint succeeds
- [ ] Upload 100 MB+ clip (multipart path) → multipart initiate + parallel parts + complete; faster than single-PUT
- [ ] Multipart cancel + resume: cancel mid-upload, drop same file in again → resumes from next missing part (network tab: /multipart/parts listing precedes new PUTs)
- [ ] Upload 11 GB file → frontend rejects with friendly "10 GB max" copy BEFORE any bytes upload
- [ ] Upload 95-min video → frontend rejects with "90 minutes max" copy
- [ ] Upload short video with weird MOV container that browser can't probe metadata → upload proceeds (fail-soft), backend ffprobe rejects with backend error if too long
Pipelines
- [ ] Trigger analysis run, watch SSE events arrive (analysis_progress, analysis_complete)
- [ ] Render export, watch progress, download result, play it
- [ ] Verify "Queued" UX when render starts and worker is busy (manually queue a 2nd render to test the queue-grow path)
Billing - [ ] Subscribe to Creator tier with a real card, verify webhook fires, verify entitlements grant - [ ] Cancel subscription mid-flight — verify cancellation feedback flow + retention offers
Assets - [ ] Upload image + video assets, verify they render in asset library
Newsletter
- [ ] Submit landing newsletter form with a fresh email → confirmation email arrives within ~30 s
- [ ] Click confirm link → lands on /newsletter/confirmed?status=confirmed (English default; pt/es subscribers land on /pt/newsletter/confirmed?... or /es/newsletter/confirmed?... once the localized Astro routes are populated — default-locale paths are unprefixed per landing/astro.config.mjs)
- [ ] Click unsubscribe link in any received email → lands on /newsletter/unsubscribed?status=unsubscribed (same locale-prefix convention as confirm above)
SSE
- [ ] DevTools network: confirm only ONE EventSource at /api/v1/events/user-stream
- [ ] Force-disconnect SSE (DevTools offline throttle): polling fallback engages within ~5s, both notifications and assets caches invalidate; reconnect → polling stops
Mobile - [ ] Log in on a real phone, do the wizard end-to-end, landscape review works
Health endpoints from monitoring tools
- [ ] /health + /health/ready both green
5.3 If smoke test fails¶
- Trace via Logfire — find failing span, fix forward if possible
- If unfixable in <30 min: rollback. From server:
./scripts/deployment/rollback.sh <previous-sha> - If schema is the issue and
rollback.shaborts: Neon time-travel restore to revert schema, then rollback the image
5.4 Announce¶
Once smoke test is green:
- Bump
api.sapari.ioTTL back to Auto (no longer need fast-rollback window) - Announce on whichever channel (Twitter, mailing list, Discord, etc.) — drip, not blast (Postmark sender reputation)
- Tail Logfire for the first hour — anomalies show up here first
Phase 6 — Post-Launch Tuning + Hardening¶
Do these in days/weeks after launch. None block go-live; each compounds reliability.
6.1 Tuning observation (week 1)¶
Watch for signals that would change the resource-sizing decisions:
- Neon scale-to-zero impact: Logfire
db.*span p99 spikes correlated with idle gaps. If real signal shows up, set min CU=1 in Neon dashboard (~$76/mo always-on). Otherwise save it - Web worker memory pressure: Beszel web container memory under sustained 100+ concurrent users. If hitting 80%+ of 6g limit, bump
WEB_MEMORY=8gin.env.production - Caddy CPU under login burst: TLS handshake spikes saturating 0.25 cpu. Bump
CADDY_CPUS=0.5if seen - Redis CPU under pubsub fan-out: 0.5 cpu sufficient at target load. Bump if Beszel shows sustained >70%
- Render queue depth: visible signal that single-worker render is bottleneck. Either raise UX expectations ("est. 30 min wait") or add a second render worker container (vertical replica)
- Postmark deliverability: log into dashboard, watch reputation score in week 1 — should climb from neutral
6.2 Monitoring on production (week 1)¶
- Add Beszel + Dozzle to production (already in
docker-compose.prod.yml; just need agent KEY/TOKEN bootstrap from the hub UI) - Gate
beszel-prod.sapari.ioanddozzle-prod.sapari.iobehind Cloudflare Access (reuse GitHub team policy) - Wire Discord webhook with prod-specific channel — alerts for CPU >80%, memory >80%, disk >90%, container restarts
- Set up Logfire alert on API error rate >1%
- Set up Logfire alert on task failure rate >5%
- Add queue-depth observability — Logfire span or admin endpoint reporting per-broker RabbitMQ queue depth (from architectural-ceiling #5). Can be a simple
/admin/queuesendpoint readingrabbitmqctl list_queues
6.3 Origin firewall + Cloudflare orange-cloud cutover¶
Two changes that land together to lock origin traffic to Cloudflare's edge.
Origin firewall — scripts/deployment/sapari-docker-firewall.sh + .service install during setup-server.sh. The unit locks the Docker DOCKER-USER iptables chain :443 to Cloudflare's published v4 + v6 egress lists (UFW alone doesn't cover Docker-NAT'd ports because Docker manages its own iptables rules that run before UFW's INPUT chain). The unit is oneshot After=docker.service, fetches CF's IP lists at boot, scopes rules to the WAN interface, re-applies idempotently. Verify with:
iptables -L DOCKER-USER -n -v --line-numbers
# Expect ~15 v4 + 7 v6 CF CIDR ACCEPTs above one final DROP for dpt:443
systemctl status sapari-docker-firewall
# Expect "Active: active (exited)" + "enabled"
Cloudflare orange-cloud cutover — once the origin firewall is verified active:
- Flip
api.sapari.ioDNS from grey (proxy OFF) to orange (proxy ON). DNS will resolve to CF egress IPs (104.21.x.x/172.67.x.x). Required for the origin firewall to work without breaking browser → API traffic (which now arrives only via CF) - Disable HTTP/3 (QUIC) zone-wide via CF Speed → Optimization. Preventive: orange-cloud + HTTP/3 + SSE produces
ERR_QUIC_PROTOCOL_ERROR. Verify:curl -sI https://api.sapari.io/healthshowsHTTP/2, noAlt-Svc: h3header - Caddy SSE compression must stay off.
caddy/Caddyfiledeliberately omitsencode gzipfrom the API + Dozzle blocks because gzip buffers 15-byte SSE keepalive frames until the buffer fills, never flushing. CF edge negotiates compression with the browser anyway, so removing Caddy-side compression is a no-op for bytes-on-wire and a fix for streaming. Same trap exists in nginx, Apache, any reverse proxy that batches before encoding
Verification: curl https://<hetzner-public-ip> from a non-CF source should hang and time out (DROP rule); curl https://api.sapari.io/health through CF should still succeed.
6.4 Performance baseline (post traffic accumulation)¶
Need ~2-3 days of real traffic before meaningful:
- CF Workers Analytics baseline: capture categorized 4xx/5xx rates from prod traffic
- R2 load test: validates Worker edge-cache behavior under real load
- Re-do performance audit with real Logfire span data: replaces static-code audit with measured p95/p99
6.5 Bundle audit (parallel work)¶
- Frontend bundle audit —
rollup-plugin-visualizerbaseline + lazy splits. Chip at the main chunk; ~3 days of work
6.6 Annual rotations¶
- Calendar reminder: rotate
MEDIA_TOKEN_SECRETannually. Procedure inmedia-token-rotation.md - Calendar reminder: rotate
SECRET_KEYannually (forces all sessions to re-auth — schedule for low-traffic window)
Operational Gotchas — Lessons from Staging¶
Concentrated reference of every "this bit us last time." Skim once before Phase 1; come back to the relevant section if something goes sideways.
Hetzner & the host¶
- Backups: enable at provisioning time, not after. ~20% surcharge for 7-day rolling snapshots covers Redis state, RabbitMQ queues, Caddy certs,
.env. Toggle at order time; flipping later leaves a no-backup window - Memory limits should be conservative — prefer task failure over daemon crash. Render worker's 6 GB limit is intentional. A render that needs >6 GB fails with a clean
FFmpegResourceError, refunds credit, notifies user. Generous limit + daemon swap = worse UX and operator nightmare - Zero swap is Hetzner default; 4 GB swap prevents swapless OOM under render spikes (Phase 2.4)
- NTP load-bearing for: (1) Caddy DNS-01 ACME signature verification (fails opaquely if skew >5 min); (2) media-token JWT TTL verification (skew >5 s breaks playback)
Tailscale¶
- Reuse the staging OAuth client — tagged
tag:ci, works for both environments - The OAuth client is fragile — if deleted from Tailscale admin, all CI deploys break. Document its existence in your ops runbook
SSH_HOSTin GitHub Secrets is the tailnet IP (100.x.y.z), not public IPv4
Caddy¶
- F1 — bind-mount inode lock (NOT YET FIXED).
git reset --hardindeploy.shreplacescaddy/Caddyfilevia tempfile rename, changing inode. Running container's bind-mount points at old inode.docker compose restart caddydoes NOT fix it; onlydocker compose -f docker-compose.prod.yml up -d --force-recreate caddydoes. Burn into muscle memory: any Caddyfile change →--force-recreate caddy - DNS-01 over HTTP-01 deliberate — no port 80, no scrambling around HTTP-01 timing
- Healthcheck endpoint is :2020, not :2019. :2019 is admin API (disabled). Match :2020 if you ever hand-write a probe
- X-Forwarded-For must derive from
cf-connecting-ip, not raw header — CF strips original. Caddyfile translates back so backend logs see real IPs (rate-limiting, fraud, audit logs depend on it) - First cert acquisition takes ~30-60 s. No cert after 2 min → check
docker logs caddyfor ACME errors (usuallyCLOUDFLARE_API_TOKENpermission or DNS not propagated) - Do NOT add
encode gzipto the API block. SSE keepalive frames never fill the gzip buffer → connection hangs. CF edge handles compression with the browser end-to-end; Caddy-side compression is redundant AND breaks streaming
Cloudflare¶
- Free Universal SSL covers depth-1 subdomains only.
dozzle-staging.sapari.ioworks free;dozzle.staging.sapari.io(depth-2) requires Advanced Cert Manager (paid). Keep ops subdomains depth-1 - Grey vs orange cloud is consequential. Pre-firewall, backend domains (
api.*) MUST be grey (Caddy terminates TLS, double-proxy breaks the cert flow). Post-firewall + Phase 6.3,api.*flips to orange to lock origin to CF's edge - CF API token scope creep: scope to Zone:DNS:Edit on sapari.io zone only. Token scope can't be narrowed retroactively
- Custom domain attachment for the Worker is in dashboard, NOT
wrangler.toml. Declaring[[routes]]+ Pages binding = error 10144 - HTTP/3 + SSE behind orange-cloud →
ERR_QUIC_PROTOCOL_ERROR. Disable HTTP/3 zone-wide before flippingapi.*orange
R2 + Worker¶
- Buckets must pre-exist before Worker deploys. Wrangler doesn't create buckets; binds to existing. Missing bucket = cryptic "no such binding" runtime error
MEDIA_TOKEN_SECRETbyte-identity is non-negotiable. Backend HS256-signs, Worker HS256-verifies. One byte different → 100% playback 401s. Verification protocol in Phase 4.4- Worker route order is first-match-wins.
/media/v1/*before/api/*(Phase 4.5) wrangler deploymay report "No deploy targets" — normal. Withworkers_dev = falseand no[[routes]], usenpx wrangler versions deploy --env <env>- Versioning toggle is per-bucket, manual, in dashboard. Once a file is deleted in a non-versioned bucket, no recovery path
Backend env vars¶
- Production security validator hard-fails on: weak
SECRET_KEY(<32 chars),POSTGRES_PASSWORD=postgres,CREATE_TABLES_ON_STARTUP=true. Soft-warns on: Redis without password, CORS*,DEBUG=true, docs in prod, weak admin creds, sessions >120 min OAUTH_REDIRECT_BASE_URL: root only, no path, no trailing slash. Code appends/api/v1/auth/oauth/callback/<provider>. Adding a path doubles it; trailing slash breaks some providers' callback registrationMEDIA_PROXY_BASE_URL: must match user-facing domain (https://app.sapari.io), not backend domain. Wrong value → clips try to play fromapi.sapari.io(no Worker route there) and 404CORS_ORIGINSmust include the landing origin. Newsletter signup form onsapari.ioPOSTs toapi.sapari.io/api/v1/newsletter/subscribe; ifsapari.ioisn't inCORS_ORIGINS, the browser blocks the request and the form renders "Misconfigured"- Three Redis DBs share one container by design (cache=0, rate-limiter=1, sessions=2, taskiq result=3). Don't consolidate
- RabbitMQ user MUST NOT be
guest/guest. Override viaRABBITMQ_DEFAULT_USER/RABBITMQ_DEFAULT_PASSin compose env, plusTASKIQ_RABBITMQ_USER/_PASSWORDin.env. Generate withopenssl rand -hex 32
Database & migrations¶
- Neon Launch tier is required for prod, not Free
- Direct endpoint over pooled — convention #16. Pooled endpoint uses PgBouncer in transaction mode, breaks asyncpg's prepared-statement cache (3-4× round-trips per query). Take direct, accept rare connection blip
- Two-deploy rule for destructive migrations. Drop column? Rename? Ship code change first (reads neither old nor new), let it soak, ship migration second. Doing both in one deploy = rollback to prior image incompatible with new schema → stuck
CONFIRM_PRODUCTION_MIGRATION=yesis the env-py prod gate. Deploy scripts pass automatically. Manual migrations on server need it explicitly- Alembic head baked into image labels.
rollback.shreadsLABEL sapari.alembic_headfrom target image, compares against live DB. Mismatch → abort - Neon time-travel = 6-hour restore window (free tier). Branch → Restore → pick a timestamp. Applies in ~30 s
TaskIQ + RabbitMQ¶
rabbitmq_delayed_message_exchangeplugin is REQUIRED. Without it,SmartRetryMiddleware's exponential backoff is silently ignored — failed downloads retry immediately, hit same error, get dropped. Plugin enabled viarabbitmq/Dockerfile- Broker (RabbitMQ) and result backend (Redis) deliberately separate. Redis crash → results lost, queue survives, tasks retry. Consolidated would lose both on Redis incident
- Priority queue mapping: viral=3, creator=2, hobby=1, free=0. Set via
.kicker().with_labels(priority=N).kiq(...). New tier means updating both code enum AND queue priority levels in compose - No eager
tasksimport in module__init__.py(Convention #17). Causes a circular import viainfrastructure.taskiq → workers/shared/context → modules.email.service → modules.email.__init__ → tasks → infrastructure.taskiqmid-load. Workers crash-loop silently under taskiq's process manager (containers reportUp,RestartCount=0, but no work progresses). CI grep enforces
Stripe¶
- Test-mode keys on prod = silent failure. Live-mode keys on staging = real charges
- Webhook signing secret per-endpoint, not per-account. Rotating endpoint generates new secret; old stops working
- Webhook signing secret mismatch is silent. Backend returns 401, Stripe retries a few times then gives up. User paid; sees no entitlements. Symptom appears 5-30 min post-charge
- Idempotency keys on Stripe API write calls — backend uses these. Add new write paths with idempotency keys
- Tier 3 + Tier 4 products are auto-seeded on first deploy. Don't pre-create
_to_dict()boundary helper inwebhooks.pyconverts Stripe'sStripeObjectto a plaindictat the webhook entry point. Required becauseStripeObjectis dict-like but not adict— downstream type-checked code (FastCRUD, Pydantic models) raises on direct pass-through
Postmark¶
- DKIM + SPF + Return-Path: all three or none. Verify all green before deploying
- Sender reputation builds slowly. Don't blast 1000+ users on day 1 — drip
- Separate Postmark server token per environment. Sharing muddles deliverability stream
- Email broker has no SmartRetry. A Postmark / RabbitMQ outage during
POST /newsletter/subscribeleaves the subscriber row in PENDING. The_recover_pending_newsletter_subscriberscron sweep is the recovery path (re-queues confirmation emails for rows >30 min old, max 3 attempts) - CAN-SPAM postal address + entity name in
base.html+base.txtfooters are legally required. The 5-casetest_template_footer.pypins the copy
Observability¶
- SQLAlchemy instrumentation: ON. Redis: OFF. SQLAlchemy spans solve "why is this endpoint slow" (high signal). Redis ops sub-millisecond, uniform (high volume, low signal). Per-worker kill-switches exist
- Span taxonomy = 6 categories: pipeline.parent / step.* / taskiq.* / service.* / ext.* / cron. Hand-named spans break dashboard filters
- Per-worker
service.namedistinguishes workers in queries. Web=sapari-api; workers=sapari-{email,analysis,render,download,proxy,asset-edit}; scheduler=sapari-scheduler - Beszel shoutrrr Discord URL format — token first, then ID, in
discord://URI form
CI/CD¶
- Image tag strategy: floating
productiontag for normal deploys, SHA tag for rollbacks. Always pin SHA inrollback.ymlinputs - Public GHCR images today; flip to private requires
docker login - Migrations run BEFORE container restart in
deploy.sh. Migration failure aborts deploy; old code keeps running on old schema. Don't reorder git cloneon the server, not rsync-only. Operators can SSH in, editdocker-compose.prod.yml, restart without CI. When CI is down, the server's git repo is your unblock
DNS + cutover¶
- Lower TTL on
api.sapari.ioto 60s before cutover (Phase 2.2). Bump to Auto in Phase 6 once stable - CF Pages CNAME flattening is automatic at apex.
sapari.ioshows as A record, not CNAME. CF feature, not bug - Brief downtime per deploy is accepted. No blue/green, no Swarm. Plain
docker compose up -d→ ~10 s API restart → frontend shows maintenance screen → React Query retries on resume
Rollback Plan¶
Three layers.
Code rollback (most common):
Aborts if new image's Alembic head differs from live DB. Pass--ignore-migration-warning if schema is backwards-compatible.
Schema rollback (Neon time travel):
- Free/Launch tier: 6-hour restore window
- Neon dashboard → Branches → Restore to a point in time — applies in ~30 seconds
- Use this if alembic downgrade -1 isn't safe
DNS-level rollback (last resort): - TTL was 60s during cutover; bump back to old IP via CF DNS edit - For "the new server is fundamentally broken" — almost never the right answer if Caddy/health checks are green but app behavior is wrong
Environment Variable Reference¶
Full .env template for production. Bold = enforced by security validator (app won't start without it set correctly).
# === App ===
ENVIRONMENT=production
DEBUG=false
SECRET_KEY=<openssl rand -base64 64> # 32+ chars
FRONTEND_URL=https://app.sapari.io
API_PUBLIC_URL=https://api.sapari.io # newsletter email link base
LANDING_URL=https://sapari.io # newsletter redirect target
CONTACT_EMAIL=hello@sapari.io
LOG_LEVEL=INFO
# === Image registry ===
GHCR_OWNER=benavlabs
IMAGE_TAG=production
# === Database (Neon Launch tier, direct endpoint) ===
DATABASE_URL=postgresql+asyncpg://<user>:<password>@<direct-endpoint>/<db>?ssl=require
# Note: use ?ssl=require — NOT ?sslmode=require (psycopg2 syntax). Drop any
# &channel_binding=require Neon's UI may copy in — asyncpg auto-negotiates SCRAM.
CREATE_TABLES_ON_STARTUP=false
POSTGRES_POOL_SIZE=20 # per-process
POSTGRES_MAX_OVERFLOW=10 # per-process
# === Redis ===
CACHE_BACKEND=redis
CACHE_REDIS_HOST=redis
CACHE_REDIS_PASSWORD=<random>
RATE_LIMITER_BACKEND=redis
RATE_LIMITER_REDIS_HOST=redis
RATE_LIMITER_REDIS_DB=1
RATE_LIMITER_REDIS_PASSWORD=<random>
SESSION_BACKEND=redis
SESSION_REDIS_HOST=redis
SESSION_REDIS_DB=2
SESSION_REDIS_PASSWORD=<random>
SESSION_SECURE_COOKIES=true
SESSION_TIMEOUT_MINUTES=480
# === RabbitMQ ===
TASKIQ_BROKER_TYPE=rabbitmq
TASKIQ_RABBITMQ_HOST=rabbitmq
TASKIQ_RABBITMQ_USER=sapari # NOT guest
TASKIQ_RABBITMQ_PASSWORD=<random> # NOT guest
TASKIQ_REDIS_HOST=redis
TASKIQ_REDIS_DB=3
# === Storage (R2) ===
STORAGE_ENDPOINT=https://<account-id>.r2.cloudflarestorage.com
STORAGE_ACCESS_KEY_ID=<from R2 dashboard>
STORAGE_SECRET_ACCESS_KEY=<from R2 dashboard>
STORAGE_BUCKET_RAW=sapari-raw
STORAGE_BUCKET_EXPORTS=sapari-exports
STORAGE_BUCKET_ASSETS=sapari-assets
STORAGE_MAX_UPLOAD_SIZE_MB=10240 # 10 GB
# === Video duration cap ===
MAX_VIDEO_DURATION_MINUTES=90 # aligns with render timeout
# === Media proxy (CF Worker) ===
MEDIA_TOKEN_SECRET=<openssl rand -base64 32> # MUST match Worker secret byte-for-byte
MEDIA_TOKEN_KID=v1
MEDIA_TOKEN_TTL_SECONDS=300
MEDIA_PROXY_BASE_URL=https://app.sapari.io
# === CORS ===
# Must include landing origin or newsletter signup POST is CORS-blocked.
CORS_ORIGINS=https://app.sapari.io,https://sapari.io
CORS_ALLOW_CREDENTIALS=true
# === Caddy / TLS ===
CLOUDFLARE_API_TOKEN=<scoped Zone:DNS:Edit on sapari.io>
# === OAuth ===
OAUTH_REDIRECT_BASE_URL=https://app.sapari.io
OAUTH_GOOGLE_CLIENT_ID=<from Google Cloud Console>
OAUTH_GOOGLE_CLIENT_SECRET=<from Google Cloud Console>
OAUTH_GITHUB_CLIENT_ID=<from GitHub OAuth App>
OAUTH_GITHUB_CLIENT_SECRET=<from GitHub OAuth App>
# === Stripe (live mode) ===
STRIPE_SECRET_KEY=sk_live_...
STRIPE_PUBLISHABLE_KEY=pk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
STRIPE_TEST_MODE=false
# === Email (Postmark) ===
POSTMARK_SERVER_TOKEN=<from Postmark>
EMAIL_SENDER_ADDRESS=hello@sapari.io
EMAIL_SENDER_NAME=Vitoria from Sapari
EMAIL_TEST_MODE=false
# === AI ===
OPENAI_API_KEY=sk-...
DEEPSEEK_API_KEY=...
# === Admin ===
ADMIN_USERNAME=<not 'admin'>
ADMIN_EMAIL=<address on sapari.io>
ADMIN_PASSWORD=<12+ chars, strong>
ADMIN_EMAIL_DOMAIN=sapari.io
# === Observability ===
LOGFIRE_TOKEN=<from logfire.pydantic.dev>
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SERVICE_NAME=sapari-api
# === Feature flags / safety ===
PRODUCTION_SECURITY_VALIDATION_ENABLED=true
ENABLE_DOCS_IN_PRODUCTION=false
# === Resource limits (parameterized in docker-compose.prod.yml) ===
# Defaults in compose match staging (4 vCPU / 16 GB). These overrides are for prod (CPX62).
WEB_WORKERS=4
WEB_MEMORY=6g
WEB_CPUS=4.0
WEB_MEMORY_RESERVATION=1g
RENDER_MEMORY=6g
RENDER_CPUS=4.0
RENDER_FFMPEG_THREADS=4
RENDER_MEMORY_RESERVATION=2g
PROXY_MEMORY=3g
PROXY_CPUS=3.0
PROXY_FFMPEG_THREADS=3
DOWNLOAD_MEMORY=2g
DOWNLOAD_CPUS=2.0
ANALYSIS_MEMORY=2g
ANALYSIS_CPUS=1.5
ANALYSIS_TASKIQ_CONCURRENCY=4
REDIS_MAXMEMORY=500mb
REDIS_MEMORY=768m
REDIS_CPUS=0.5
RABBITMQ_MEMORY=1g
CADDY_MEMORY=256m
CADDY_CPUS=0.25
See also¶
For deeper detail on any phase:
deployment.md— deploy strategies, image tags, scaling, rollback mechanicsmigrations.md— Alembic workflow, production safety, downgrades, two-deploy ruleexternal-services.md— provisioning runbook for Neon, R2, Stripe, Postmarkcloudflare-workers.md— Worker deploy, secrets, route-pattern gotcha, troubleshooting matrixmonitoring.md— Logfire instrumentation, span taxonomy, alert thresholds, Beszel/Dozzlescripts.md— what each script inscripts/deployment/doesmedia-token-rotation.md— annual rotation procedure forMEDIA_TOKEN_SECRET