Production Launch Runbook¶

End-to-end procedure for taking Sapari from "staging works" to "production live" on a self-managed Hetzner CPX62 (16 vCPU / 32 GB) sized for up to 1000 concurrent users.

This document is the launch-day spine: pace yourself through Phases 1-6 in order, and jump into a dedicated runbook (linked at the bottom) when a step needs more depth. Time budget: ~6-8 hours of focused work spread across a day or two; Phase 5 (cutover) takes ~2 hours including smoke testing.

For ongoing operations (steady-state deploys, rollbacks, monitoring) see deployment.md. This runbook only covers the first-time launch and the operational gotchas that go with it.

Topology¶

                    ┌──────────────────────────┐
                    │ Cloudflare DNS (sapari.io)│
                    └────────────┬─────────────┘
                                 │
          ┌──────────────────────┼──────────────────────────┐
          │                      │                          │
          ▼                      ▼                          ▼
    sapari.io          app.sapari.io                api.sapari.io
    (Pages: landing)   (Pages: frontend)            (grey cloud)
                            │                              │
                            │  /api/*, /media/v1/*         │
                            ▼                              ▼
                  ┌──────────────────┐            ┌──────────────────┐
                  │ CF Worker        │            │ Hetzner CPX62    │
                  │ (sapari-proxy-   │            │ 16 vCPU / 32 GB  │
                  │  production)     │────────────│ Caddy → web      │
                  └────────┬─────────┘            │ + 6 TaskIQ       │
                           │                      │   workers        │
                           │ R2 bindings          │ + scheduler      │
                           ▼                      │ + Redis          │
                    ┌──────────────┐              │ + RabbitMQ       │
                    │ R2 buckets   │              └────────┬─────────┘
                    │ raw/exports/ │                       │
                    │  assets      │                       │ DATABASE_URL
                    └──────────────┘                       ▼
                                                  ┌──────────────────┐
                                                  │ Neon Postgres    │
                                                  │ (Launch tier)    │
                                                  └──────────────────┘

External integrations: Stripe (webhooks → api.sapari.io), Postmark (email), OpenAI + DeepSeek (AI), Logfire (observability), Sentry (frontend, optional), Discord (Beszel alerts, optional), Tailscale (CI/CD SSH).

Capacity Sizing — Target¶

Goal: support up to 1000 concurrent users decently usable on a single CPX62 box.

Per-user resource footprint (verified in infrastructure/events/subscriber.py:266-312): - 1 long-lived SSE response held by one uvicorn worker - 2 Redis pubsub connections (notification channel + asset channel, fan-in pattern) - +1 more Redis pubsub conn when in the editor (subscribe_to_project)

Aggregate at 1000 concurrent users: - ~3000-4000 file descriptors held by the web container - ~2300 Redis pubsub connections - ~50-100 simultaneous active request senders out of 1000 (rest idle on SSE) - Realistic concurrent DB queries: ~30-50

Box budget:

	CPU (vCPU)	Memory (GB)
Web (FastAPI, 4 workers)	4.0	6.0
6 TaskIQ workers + scheduler	11.6	14.5
Caddy + Redis + RabbitMQ	1.25	2.0
Dozzle + Beszel hub + agent	0.45	0.4
Allocated	~17.3	~22.9
CPX62	16	32
Headroom (incl. OS/page cache/FFmpeg surge)	(-1.3 oversubscription, fine on async I/O)	9.1 (28%)

CPU oversubscription is fine on shared-vCPU CPX62 because Sapari is async-I/O-bound (asyncpg, httpx, Redis) — workers spend most cycles waiting on network. The CPU caps are ceilings, not reservations.

Architectural Ceilings — Acknowledged for v1¶

Limits inherent to the single-box architecture. Tuning does NOT solve them; they need product/UX decisions or post-v1 horizontal scaling. Listed up front so they don't get lost:

Render throughput is single-threaded. One render worker, one FFmpeg job at a time. A 60-min source render takes ~30-60 min wallclock. At 1000 concurrent users, even 0.5% triggering a render in the same hour = 5 queued, last user waits 2+ hours. v1 mitigation: explicit "Queued" UX (already in export-progress UI — verify copy in smoke test). Post-v1 fix: render worker replicas.
Proxy throughput is single-threaded (same shape, lower demand — only triggers on codec mismatch on upload).
External API rate limits:
OpenAI Whisper: ~50 RPM on Tier 1. Verify Tier 2+ for 1000-user target.
DeepSeek: verify plan limits.
Postmark: realistic email volume at 1000 users is 2-5k/month (auth + billing events; render/analysis are SSE, not email). Starter ($15/mo, 10k cap) covers steady state; 50k tier ($50/mo) buys launch-burst headroom.
SSE reconnect storms. Web container restart drops 1000 SSE connections simultaneously; frontend exponential backoff (1s, 2s, 4s, 8s, 16s, cap 30s) helps but resets on onopen — second-wave thunder is possible. Mitigation (jitter on first retry) is post-launch.
RabbitMQ queue depth visibility. No default operator dashboard for queue depth. Mgmt UI at :15672 shows it but isn't exposed. Add queue-depth Logfire span or admin endpoint post-launch.
Single-box failure mode. OOM, FFmpeg subprocess crash, RabbitMQ memory pressure — any one takes down some-or-all users. v1 accepts; manual recovery in §Operational Gotchas.

Prerequisites — Accounts You Need¶

Sign up for these before starting Phase 1. None require coordination with another step.

Phase 1 — Provision External Services¶

Order doesn't matter; everything is independent. Aim to finish in one sitting so secrets are fresh.

1.1 Neon Postgres (Launch tier)¶

Create new Neon project named sapari-production
Region: pick the same one as the Hetzner box (Ashburn → us-east-1, Hillsboro → us-west-2, Falkenstein → eu-central-1)
Subscribe to Launch tier ($0.106/CU-hour). Free tier is too restrictive for prod
Scale-to-zero: keep enabled (default 5 min). pool_pre_ping=True (session.py:14-19) + pool_recycle=300 catch stale-after-pause connections before any query runs. Watch for Logfire db.* p99 spikes post-launch; if seen, set min CU=1 ($76/mo always-on)
Copy the direct endpoint connection string (NOT pooler) — convention #16
Convert to async form: postgresql+asyncpg://...
Save as DATABASE_URL for Phase 3

Connection budget: at min CU=1, max_connections=419. Cluster ceiling at WEB_WORKERS=4 is web 4×30 + workers 6×30 + scheduler 30 = 330. Comfortable 21% headroom. Auto-scales to 16 CU under burst.

1.2 Cloudflare R2 (3 buckets)¶

Create buckets: sapari-raw, sapari-exports, sapari-assets
Enable versioning on all three before any data lands. R2 doesn't version by default; an accidental delete or overwrite is gone forever. Toggle at the bucket level in the dashboard. CANNOT be retroactively enabled to recover already-deleted files
Create an API token scoped to Object Read & Write on all three buckets only — narrower than account-wide. Token scope is one-time at creation; can't be narrowed retroactively
Save: STORAGE_ACCESS_KEY_ID, STORAGE_SECRET_ACCESS_KEY, STORAGE_ENDPOINT (the https://<account-id>.r2.cloudflarestorage.com form)

1.3 Cloudflare Worker secret (placeholder)¶

MEDIA_TOKEN_SECRET=$(openssl rand -base64 32) — save it; this exact value goes both in the backend .env and as a Worker secret. Byte-identical is load-bearing

1.4 Stripe (live mode)¶

Toggle Stripe Dashboard to live mode
Copy sk_live_... and pk_live_... → STRIPE_SECRET_KEY, STRIPE_PUBLISHABLE_KEY. Test-mode keys (sk_test_) on production fail every transaction silently; live-mode keys on staging will charge real cards. Verify mode visually before copying
Create webhook endpoint at https://api.sapari.io/api/v1/webhooks/stripe — direct to backend (api.*, NOT app.*). URL won't resolve yet; create it anyway. The CF Worker proxy is for browser API calls; webhooks should hit Caddy → backend directly
Subscribe to events: customer.subscription.updated, customer.subscription.deleted, invoice.payment_failed, charge.refunded, checkout.session.completed
Copy webhook signing secret (different from API keys; one per endpoint) → STRIPE_WEBHOOK_SECRET. A wrong webhook secret silently breaks payment processing — backend returns 401, Stripe retries a few times then gives up, subscriptions don't activate. Symptom: "user paid but doesn't see credits." Test reachability post-deploy via Stripe Dashboard's "Send test webhook" button (Phase 4.7)
Set STRIPE_TEST_MODE=false
Tier 3 (Creator) and Tier 4 (Viral) products + prices are auto-seeded by seed_stripe_products.py on first deploy; do not pre-create

1.5 Postmark¶

DNS propagation can take hours — start early in Phase 1 so it's done by Phase 5.

Add sapari.io as a sender domain
Add all three DNS records to Cloudflare. Postmark Dashboard surfaces exact values:
DKIM (<selector>._domainkey.sapari.io TXT) — signs outbound mail
SPF (TXT on apex) — authorizes Postmark to send on your behalf
Return-Path (CNAME) — bounce-handling subdomain
Wait until all three show green in Postmark dashboard before proceeding. Skipping = mail in spam or bounced
Plan tier: Starter ($15/mo, 10k emails) covers 1000 users in steady state (auth + billing events only — render/analysis/asset events are SSE, not email). Realistic volume ~2-5k emails/month. Upgrade to 50k tier ($50/mo) if you want launch-burst headroom
Use a separate Postmark server token per environment (one for staging, one for production). Sharing muddles deliverability stream
Save the production server token → POSTMARK_SERVER_TOKEN
Sender reputation warning: even with green DKIM/SPF/Return-Path, Postmark starts new domains with low reputation. ISPs throttle. Don't blast 1000+ users on day 1 — drip launch announcements

1.6 OpenAI + DeepSeek¶

Create API key at platform.openai.com → OPENAI_API_KEY. Tier 2+ recommended for 1000-user target: Tier 1 caps Whisper at ~50 RPM; sustained burst at scale could 429. OpenAI auto-tiers up with usage history; if you've been on Tier 1, request Tier 2 explicitly via support before launch
Create API key at platform.deepseek.com → DEEPSEEK_API_KEY. Verify plan limits cover ~80 analyses/hour throughput target
Set hard spending caps in both dashboards (recommended: $200/mo OpenAI, $50/mo DeepSeek to start; tune after real usage)

1.7 Logfire¶

Create project named sapari-production (or share with staging using environment tag)
Get write token → LOGFIRE_TOKEN
If sharing with staging: set LOGFIRE_ENVIRONMENT=production in prod .env so spans are tagged

1.8 Optional but recommended¶

Sentry — frontend project, copy DSN. Add to frontend/.env.production as VITE_SENTRY_DSN. Frontend's perfMarks.ts already adds breadcrumbs under category perf
Discord webhook — create in your Discord server. Beszel uses shoutrrr format, NOT raw HTTPS URL. Beszel silently swallows malformed URLs:
```
Raw Discord URL:     https://discord.com/api/webhooks/<webhook-id>/<token>
Shoutrrr format:     discord://<token>@<webhook-id>
```
Note token first, then ID — opposite of the URL form. Verify alerts arrive (stress -c 4 -t 60 triggers CPU>80% alert)

Phase 2 — Hetzner CPX62 + Tailscale + Caddy Setup¶

Allow ~90 min start to finish; longer if first time.

2.1 Pick the box¶

Size: CPX62 — 16 shared vCPU, 32 GB RAM, 640 GB SSD, 20 TB transfer, $59.49/mo. Sized to support 1000 concurrent users with the production tuning. CCX33 (8 vCPU dedicated, 32 GB) is a viable swap if you prefer dedicated CPU at higher cost
OS: Ubuntu 24.04 LTS — setup-server.sh checks for ssh.service (24.04+) vs sshd.service (older); 24.04 is the tested baseline
Region: match the Neon region from Phase 1.1. The DB-to-server hop is on the critical path for every API request
SSH key: add yours to the Hetzner project (cloud-init drops it into root@ authorized_keys automatically)
Firewall: attach the existing firewall-tailscale Hetzner Cloud Firewall at server creation if you have one. Defense in depth on top of host-level UFW (which setup-server.sh configures separately). Rule set should cover 443/tcp + 443/udp (HTTP/3) from anywhere, 22/tcp from Tailscale CGNAT (100.64.0.0/10), and IPv6 sources (::/0) — not just IPv4
Enable automated backups at order time — Hetzner adds ~20% to monthly cost (~$12/mo for CPX62) for 7-day rolling snapshots of the entire disk. Toggle during provisioning; flipping it on after leaves an initial unprotected window. Dramatically cheaper than a custom backup pipeline; covers Redis, RabbitMQ, Caddy certs, .env in one operation
Note the IPv4 — this becomes api.sapari.io's A record

2.2 Add the DNS record now (low TTL, grey cloud)¶

Do this before setup-server.sh so DNS has time to propagate by the time Caddy needs it for ACME DNS-01.

Cloudflare DNS for sapari.io → add A record:
Name: api
IPv4: <hetzner-ip>
Proxy: OFF (grey cloud) — Caddy terminates TLS itself; double-proxying through CF orange-cloud breaks the cert flow at this point. Once stable, flipping to orange-cloud is a post-launch hardening step (see Phase 6.3)
TTL: 60 seconds — fast rollback option during cutover. Bump to Auto in Phase 6

2.3 Run setup-server.sh¶

setup-server.sh is idempotent (safe to re-run) but takes a required --my-ip flag — whitelists only that IP for SSH on port 22. Wrong IP = locked out (Hetzner has console rescue if needed).

# Find your operator IP first:
curl -s ifconfig.me

# Then on the server, as root:
ssh root@<hetzner-ip>
git clone https://github.com/benavlabs/sapari.git /opt/sapari-bootstrap
cd /opt/sapari-bootstrap
./scripts/deployment/setup-server.sh --my-ip <your-operator-ip> --hostname sapari-prod

What it does (verify each):

Step	What
Hostname	Sets to `sapari-prod`; updates `/etc/hosts`
`deploy` user	Created with `sudo` + `docker` group, passwordless sudo (intentional for CD; safe because SSH is key-only and tailnet-gated)
SSH hardening	Disables root login, disables password auth, requires pubkey
UFW firewall	Default deny inbound; allows `443/tcp` from anywhere; `22/tcp` only from `--my-ip`. No port 80 — Caddy uses DNS-01 ACME
`unattended-upgrades` + `fail2ban`	Auto-security-patches; SSH brute-force throttling
Docker	Official `get.docker.com` install
GitHub deploy key	Generates Ed25519 keypair at `/home/deploy/.ssh/github_deploy_key`; pre-seeds `known_hosts` for github.com. Prints the public key at the end — manually add to repo's Deploy Keys
Origin firewall service	Installs `sapari-docker-firewall.service` (oneshot systemd unit, `After=docker.service`) that locks the Docker `DOCKER-USER` chain :443 to Cloudflare's published egress IPs. See Phase 6.3

What it does NOT do (manual, in 2.4): - Install or join Tailscale - Add swap - Verify NTP - Configure GHCR auth (not needed — images are public)

Copy printed Ed25519 public key
GitHub repo → Settings → Deploy keys → Add deploy key → paste, name it sapari-prod, leave "Allow write access" unchecked → Add
Verify deploy user can clone: ssh deploy@<hetzner-ip> "ssh -T git@github.com" should print "Hi ! ..."

2.4 Manual hardening — swap, NTP, Tailscale¶

Swap (prevents FFmpeg OOM):

Hetzner cloud images ship with zero swap. Render worker FFmpeg + uploaded video can spike past per-container limits during a large render → swapless OOM takes down the whole box.

# As root or with sudo:
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
swapon --show  # verify

4 GB swap created and persisted across reboots

NTP (load-bearing for JWT verification):

Media token JWTs have a 5-minute TTL. Clock skew of even ~10s between server and CF Worker breaks playback ("token expired" 401s on freshly-minted URLs). systemd-timesyncd is on by default in Ubuntu 24.04, but verify:

timedatectl
# Want: "System clock synchronized: yes" + "NTP service: active"

Clock is NTP-synced

Tailscale (CI access path):

Tailscale is pre-installed on Hetzner Ubuntu but not running. Bring up manually:

sudo tailscale up --hostname=sapari-prod
# Follow URL, log in, authorize the node

Then in Tailscale admin console:

Machines → find sapari-prod → Edit ACL tags → add tag:server
Verify ACL grants tag:ci → tag:server (or allow-all). The OAuth client used by GitHub Actions is reusable from staging
Note the tailnet IP (100.x.y.z) — this is what GitHub Actions SSHs to, not the public IPv4. Save as SSH_HOST GitHub secret for the production environment

2.5 GitHub Actions secrets for the `production` environment¶

GitHub repo → Settings → Environments → New environment named production. Add:

SSH_HOST — tailnet IP from 2.4 (not public IPv4)
SSH_KEY — private half of a fresh Ed25519 keypair generated locally; install public half on server with ssh-copy-id deploy@<tailnet-ip> (over Tailscale). Keep this key separate from your operator key — CI-only, revocable independently
TS_OAUTH_CLIENT_ID and TS_OAUTH_SECRET — same as staging environment

Set deploy gate (typed YES confirmation, required reviewer, deploy window) on the production environment's protection rules if desired.

2.6 Caddy — what to expect, and the F1 gotcha¶

Caddy is one of the containers first-deploy.sh will start in Phase 3. A few things to know:

Config: caddy/Caddyfile (committed). Reverse-proxies api.sapari.io → web:8000, sets X-Forwarded-For from Cloudflare's cf-connecting-ip so backend logs see real client IPs
TLS via Cloudflare DNS-01: needs CLOUDFLARE_API_TOKEN in .env (Phase 3.1) with Zone:DNS:Edit permission on sapari.io only. First deploy: cert acquisition takes ~30-60 seconds; check docker logs caddy if no successful cert event
F1 (known gotcha — not yet fixed in deploy.sh): docker-compose.prod.yml bind-mounts ./caddy/Caddyfile. When deploy.sh runs git reset --hard, git replaces the file via tempfile rename — inode changes, container's bind mount points at stale inode. Caddy doesn't see new config until the container is force-recreated.

Workaround: any time you change the Caddyfile, after deploy completes, run on the server: docker compose -f docker-compose.prod.yml up -d --force-recreate caddy. Burn into muscle memory.

2.7 Pre-flight before Phase 3¶

dig api.sapari.io +short from anywhere returns the Hetzner IPv4
ssh deploy@<tailnet-ip> works (over Tailscale)
ssh deploy@<tailnet-ip> "docker --version" prints a version (deploy in docker group)
ssh deploy@<tailnet-ip> "swapon --show" shows the 4 GB swapfile
ssh deploy@<tailnet-ip> "timedatectl | grep synchronized" shows yes
GitHub repo Deploy keys list contains sapari-prod Ed25519 key
GitHub production environment has all four secrets set

Phase 3 — First Deploy¶

This is where the prod app first runs. Plan ~1 hour with debugging margin.

3.1 Clone repo + write `.env.production`¶

ssh deploy@<hetzner-ip>  # via Tailscale
git clone https://github.com/benavlabs/sapari.git ~/sapari
cd ~/sapari
cp backend/.env.production.example .env

Open .env and fill in every required value. Use the Environment Variable Reference at the bottom as the checklist

Critical values the security validator enforces (app refuses to start otherwise): - SECRET_KEY — 32+ chars; python -c "import secrets; print(secrets.token_urlsafe(64))" - POSTGRES_PASSWORD — must NOT be postgres (or pass full DATABASE_URL instead) - CREATE_TABLES_ON_STARTUP=false - ENVIRONMENT=production - DEBUG=false - STRIPE_TEST_MODE=false - SESSION_SECURE_COOKIES=true - ADMIN_USERNAME — NOT admin - ADMIN_PASSWORD — 12+ chars, not in weak password list - TASKIQ_RABBITMQ_USER and _PASSWORD — NOT guest/guest

Differs from staging — double-check: - OAUTH_REDIRECT_BASE_URL=https://app.sapari.io - MEDIA_PROXY_BASE_URL=https://app.sapari.io (no trailing slash) - FRONTEND_URL=https://app.sapari.io - API_PUBLIC_URL=https://api.sapari.io (newsletter confirm/unsubscribe email links) - LANDING_URL=https://sapari.io (newsletter confirm/unsubscribe redirect targets) - CORS_ORIGINS=https://app.sapari.io,https://sapari.io (must include landing or newsletter signup POST is CORS-blocked) - LOGFIRE_ENVIRONMENT=production - All Stripe keys live (sk_live_, pk_live_, whsec_ from live webhook)

Tuning values to set explicitly for prod (these turn the parameterization on — without them, compose uses staging defaults): - WEB_WORKERS=4, WEB_MEMORY=6g, WEB_CPUS=4.0, WEB_MEMORY_RESERVATION=1g - RENDER_MEMORY=6g, RENDER_CPUS=4.0, RENDER_FFMPEG_THREADS=4, RENDER_MEMORY_RESERVATION=2g - PROXY_MEMORY=3g, PROXY_CPUS=3.0, PROXY_FFMPEG_THREADS=3 - DOWNLOAD_MEMORY=2g, DOWNLOAD_CPUS=2.0 - ANALYSIS_MEMORY=2g, ANALYSIS_CPUS=1.5, ANALYSIS_TASKIQ_CONCURRENCY=4 - REDIS_MAXMEMORY=500mb, REDIS_MEMORY=768m, REDIS_CPUS=0.5 - RABBITMQ_MEMORY=1g - CADDY_MEMORY=256m, CADDY_CPUS=0.25 - POSTGRES_POOL_SIZE=20, POSTGRES_MAX_OVERFLOW=10 - STORAGE_MAX_UPLOAD_SIZE_MB=10240 - MAX_VIDEO_DURATION_MINUTES=90

3.2 Verify GHCR image pull¶

GHCR images are public today — no docker login needed.

docker pull ghcr.io/benavlabs/sapari-backend:production

Pull succeeds. If you ever flip the repo to private: PAT with read:packages, then echo $TOKEN | docker login ghcr.io -u <user> --password-stdin

3.3 Run first-deploy.sh¶

./scripts/deployment/first-deploy.sh

Order: pull image → migrate (alembic upgrade head) → seed (tiers, admin user, Stripe products) → start all services → health check.

First-time seed creates: 4 tiers (free/hobby/creator/viral), the admin user, Stripe products for Creator + Viral
Watch for migration errors — they abort the deploy
If seed_stripe_products.py errors → STRIPE_SECRET_KEY is wrong. Fix and re-run just the seed: ./scripts/deployment/run-task.sh backend/scripts/seed_stripe_products.py

3.4 Verify the box is alive (before DNS routes traffic)¶

# On the server:
curl -f http://localhost:8000/health           # liveness
curl -f http://localhost:8000/health/ready     # readiness — DB + Redis + RabbitMQ + storage all green

# From your laptop, hitting IP directly (Caddy will reject; cert is for api.sapari.io):
curl -k https://<hetzner-ip>/health -H "Host: api.sapari.io"

Liveness: 200
Readiness: 200 with all subsystems green
If Caddy hasn't obtained the cert, check docker logs caddy — DNS-01 needs valid CLOUDFLARE_API_TOKEN and api.sapari.io resolving to this IP

3.5 Verify externally once DNS propagates¶

# After a couple minutes:
curl https://api.sapari.io/health
curl https://api.sapari.io/health/ready

Both return 200 with valid certs (no -k needed)

Phase 4 — Cloudflare Worker + Pages + DNS¶

The Worker handles /api/* proxying to api.sapari.io and /media/v1/* for R2 media. Pages serves the frontend at app.sapari.io and the landing at sapari.io.

4.1 Frontend build env¶

In frontend/, create or update .env.production (committed file is fine; no secrets):
VITE_API_BASE_URL=https://app.sapari.io (Worker proxies /api/* to backend)
VITE_SENTRY_DSN=<from 1.8> if using Sentry
Push to main (or your prod branch) — Cloudflare Pages builds + deploys automatically

4.2 Frontend custom domain¶

CF Pages dashboard → frontend project → Custom domains → add app.sapari.io
CF auto-creates the CNAME; verify it propagates

4.3 Landing page¶

Same flow for landing/ Pages project — custom domain sapari.io (apex). Pages handles apex via CNAME flattening
Set PUBLIC_API_ORIGIN=https://api.sapari.io in the landing Pages project env vars so the newsletter signup form POSTs to the correct origin (empty value renders an inline "Misconfigured" error on the signup button)

4.4 Cloudflare Worker — secret + deploy¶

# From your laptop, in worker/ directory:
cd worker
npx wrangler secret put MEDIA_TOKEN_SECRET_V1 --env production
# Paste the SAME value you put in backend .env as MEDIA_TOKEN_SECRET — byte-identical

npm run deploy:production
# If wrangler reports "No deploy targets":
npx wrangler versions deploy --env production

Verify fingerprints match. Backend logs: docker logs sapari-backend | grep media_token should print media_token: active=v1 registry=[v1:<8-char-hex>]. Then npx wrangler tail --env production and load a clip in the browser — same fingerprint should appear

4.5 Cloudflare Worker — route patterns (dashboard, NOT wrangler.toml)¶

Worker [[routes]] in wrangler.toml is intentionally empty — Pages binding can't coexist with route patterns there (CF returns error 10144 if both are declared).

CF Dashboard → Workers & Pages → sapari-proxy-production → Settings → Domains & Routes → Add (order matters — first match wins):
app.sapari.io/media/v1/* ← add this first
app.sapari.io/api/* ← then this

If /api/* comes before /media/v1/*, media requests like /media/v1/<jwt> match the API rule first and get proxied to the backend. Symptom: 404 on every clip play. No error message at deploy — purely order-of-rules.

Without route patterns at all, requests bypass the Worker and hit Pages' SPA fallback (returns index.html for /api/* — symptom is "API calls return HTML").

4.6 Cloudflare Access policies¶

Production app (app.sapari.io): NO Access policy. Public
internal-docs.sapari.io: same Access app as staging (GitHub team benavlabs/Sapari)
If/when adding dozzle-prod.sapari.io and beszel-prod.sapari.io: gate via the same Access app

4.7 Stripe webhook reachability test¶

Stripe Dashboard → Webhooks → your endpoint → "Send test webhook" → pick customer.subscription.updated
Check docker logs sapari-backend | grep webhook — should show signature-verified receipt

Phase 5 — Cutover & Smoke Test¶

Live moment. ~2 hours including verification.

5.1 Pre-flight (do once, before the cutover hour)¶

All Phases 1-4 boxes ticked
https://api.sapari.io/health/ready returns 200 from outside the network
https://app.sapari.io loads the frontend (with Worker routes, /api/* proxies)
Stripe live webhook test passes
Postmark send-test passes
TTL on api.sapari.io is 60s (set in 2.2)

5.2 Smoke test the running prod app (before announcing)¶

Run through every critical user journey while the app is live but unannounced. Use a real (non-admin) user account or create one fresh.

Auth + onboarding - [ ] Signup → email verification → login → logout → login again - [ ] OAuth: Google login (if configured), GitHub login (if configured)

Editor cold paths - [ ] Create two projects, switch between them — editor doesn't blank, no wrong-project mutation if you spam delete-edit - [ ] Cold load /projects/:uuid: editor skeleton renders, not generic spinner - [ ] Cold load /projects/new: list skeleton, NOT editor skeleton

Upload paths - [ ] Upload < 25 MiB clip (single-PUT path) → presign + PUT, confirm endpoint succeeds - [ ] Upload 100 MB+ clip (multipart path) → multipart initiate + parallel parts + complete; faster than single-PUT - [ ] Multipart cancel + resume: cancel mid-upload, drop same file in again → resumes from next missing part (network tab: /multipart/parts listing precedes new PUTs) - [ ] Upload 11 GB file → frontend rejects with friendly "10 GB max" copy BEFORE any bytes upload - [ ] Upload 95-min video → frontend rejects with "90 minutes max" copy - [ ] Upload short video with weird MOV container that browser can't probe metadata → upload proceeds (fail-soft), backend ffprobe rejects with backend error if too long

Pipelines - [ ] Trigger analysis run, watch SSE events arrive (analysis_progress, analysis_complete) - [ ] Render export, watch progress, download result, play it - [ ] Verify "Queued" UX when render starts and worker is busy (manually queue a 2^nd render to test the queue-grow path)

Billing - [ ] Subscribe to Creator tier with a real card, verify webhook fires, verify entitlements grant - [ ] Cancel subscription mid-flight — verify cancellation feedback flow + retention offers

Assets - [ ] Upload image + video assets, verify they render in asset library

Newsletter - [ ] Submit landing newsletter form with a fresh email → confirmation email arrives within ~30 s - [ ] Click confirm link → lands on /newsletter/confirmed?status=confirmed (English default; pt/es subscribers land on /pt/newsletter/confirmed?... or /es/newsletter/confirmed?... once the localized Astro routes are populated — default-locale paths are unprefixed per landing/astro.config.mjs) - [ ] Click unsubscribe link in any received email → lands on /newsletter/unsubscribed?status=unsubscribed (same locale-prefix convention as confirm above)

SSE - [ ] DevTools network: confirm only ONE EventSource at /api/v1/events/user-stream - [ ] Force-disconnect SSE (DevTools offline throttle): polling fallback engages within ~5s, both notifications and assets caches invalidate; reconnect → polling stops

Mobile - [ ] Log in on a real phone, do the wizard end-to-end, landscape review works

Health endpoints from monitoring tools - [ ] /health + /health/ready both green

5.3 If smoke test fails¶

Trace via Logfire — find failing span, fix forward if possible
If unfixable in <30 min: rollback. From server: ./scripts/deployment/rollback.sh <previous-sha>
If schema is the issue and rollback.sh aborts: Neon time-travel restore to revert schema, then rollback the image

5.4 Announce¶

Once smoke test is green:

Bump api.sapari.io TTL back to Auto (no longer need fast-rollback window)
Announce on whichever channel (Twitter, mailing list, Discord, etc.) — drip, not blast (Postmark sender reputation)
Tail Logfire for the first hour — anomalies show up here first

Phase 6 — Post-Launch Tuning + Hardening¶

Do these in days/weeks after launch. None block go-live; each compounds reliability.

6.1 Tuning observation (week 1)¶

Watch for signals that would change the resource-sizing decisions:

Neon scale-to-zero impact: Logfire db.* span p99 spikes correlated with idle gaps. If real signal shows up, set min CU=1 in Neon dashboard (~$76/mo always-on). Otherwise save it
Web worker memory pressure: Beszel web container memory under sustained 100+ concurrent users. If hitting 80%+ of 6g limit, bump WEB_MEMORY=8g in .env.production
Caddy CPU under login burst: TLS handshake spikes saturating 0.25 cpu. Bump CADDY_CPUS=0.5 if seen
Redis CPU under pubsub fan-out: 0.5 cpu sufficient at target load. Bump if Beszel shows sustained >70%
Render queue depth: visible signal that single-worker render is bottleneck. Either raise UX expectations ("est. 30 min wait") or add a second render worker container (vertical replica)
Postmark deliverability: log into dashboard, watch reputation score in week 1 — should climb from neutral

6.2 Monitoring on production (week 1)¶

Add Beszel + Dozzle to production (already in docker-compose.prod.yml; just need agent KEY/TOKEN bootstrap from the hub UI)
Gate beszel-prod.sapari.io and dozzle-prod.sapari.io behind Cloudflare Access (reuse GitHub team policy)
Wire Discord webhook with prod-specific channel — alerts for CPU >80%, memory >80%, disk >90%, container restarts
Set up Logfire alert on API error rate >1%
Set up Logfire alert on task failure rate >5%
Add queue-depth observability — Logfire span or admin endpoint reporting per-broker RabbitMQ queue depth (from architectural-ceiling #5). Can be a simple /admin/queues endpoint reading rabbitmqctl list_queues

6.3 Origin firewall + Cloudflare orange-cloud cutover¶

Two changes that land together to lock origin traffic to Cloudflare's edge.

Origin firewall — scripts/deployment/sapari-docker-firewall.sh + .service install during setup-server.sh. The unit locks the Docker DOCKER-USER iptables chain :443 to Cloudflare's published v4 + v6 egress lists (UFW alone doesn't cover Docker-NAT'd ports because Docker manages its own iptables rules that run before UFW's INPUT chain). The unit is oneshot After=docker.service, fetches CF's IP lists at boot, scopes rules to the WAN interface, re-applies idempotently. Verify with:

iptables -L DOCKER-USER -n -v --line-numbers
# Expect ~15 v4 + 7 v6 CF CIDR ACCEPTs above one final DROP for dpt:443
systemctl status sapari-docker-firewall
# Expect "Active: active (exited)" + "enabled"

Cloudflare orange-cloud cutover — once the origin firewall is verified active:

Flip api.sapari.io DNS from grey (proxy OFF) to orange (proxy ON). DNS will resolve to CF egress IPs (104.21.x.x / 172.67.x.x). Required for the origin firewall to work without breaking browser → API traffic (which now arrives only via CF)
Disable HTTP/3 (QUIC) zone-wide via CF Speed → Optimization. Preventive: orange-cloud + HTTP/3 + SSE produces ERR_QUIC_PROTOCOL_ERROR. Verify: curl -sI https://api.sapari.io/health shows HTTP/2, no Alt-Svc: h3 header
Caddy SSE compression must stay off. caddy/Caddyfile deliberately omits encode gzip from the API + Dozzle blocks because gzip buffers 15-byte SSE keepalive frames until the buffer fills, never flushing. CF edge negotiates compression with the browser anyway, so removing Caddy-side compression is a no-op for bytes-on-wire and a fix for streaming. Same trap exists in nginx, Apache, any reverse proxy that batches before encoding

Verification: curl https://<hetzner-public-ip> from a non-CF source should hang and time out (DROP rule); curl https://api.sapari.io/health through CF should still succeed.

6.4 Performance baseline (post traffic accumulation)¶

Need ~2-3 days of real traffic before meaningful:

CF Workers Analytics baseline: capture categorized 4xx/5xx rates from prod traffic
R2 load test: validates Worker edge-cache behavior under real load
Re-do performance audit with real Logfire span data: replaces static-code audit with measured p95/p99

6.5 Bundle audit (parallel work)¶

Frontend bundle audit — rollup-plugin-visualizer baseline + lazy splits. Chip at the main chunk; ~3 days of work

6.6 Annual rotations¶

Calendar reminder: rotate MEDIA_TOKEN_SECRET annually. Procedure in media-token-rotation.md
Calendar reminder: rotate SECRET_KEY annually (forces all sessions to re-auth — schedule for low-traffic window)

Operational Gotchas — Lessons from Staging¶

Concentrated reference of every "this bit us last time." Skim once before Phase 1; come back to the relevant section if something goes sideways.

Hetzner & the host¶

Backups: enable at provisioning time, not after. ~20% surcharge for 7-day rolling snapshots covers Redis state, RabbitMQ queues, Caddy certs, .env. Toggle at order time; flipping later leaves a no-backup window
Memory limits should be conservative — prefer task failure over daemon crash. Render worker's 6 GB limit is intentional. A render that needs >6 GB fails with a clean FFmpegResourceError, refunds credit, notifies user. Generous limit + daemon swap = worse UX and operator nightmare
Zero swap is Hetzner default; 4 GB swap prevents swapless OOM under render spikes (Phase 2.4)
NTP load-bearing for: (1) Caddy DNS-01 ACME signature verification (fails opaquely if skew >5 min); (2) media-token JWT TTL verification (skew >5 s breaks playback)

Tailscale¶

Reuse the staging OAuth client — tagged tag:ci, works for both environments
The OAuth client is fragile — if deleted from Tailscale admin, all CI deploys break. Document its existence in your ops runbook
SSH_HOST in GitHub Secrets is the tailnet IP (100.x.y.z), not public IPv4

Caddy¶

F1 — bind-mount inode lock (NOT YET FIXED). git reset --hard in deploy.sh replaces caddy/Caddyfile via tempfile rename, changing inode. Running container's bind-mount points at old inode. docker compose restart caddy does NOT fix it; only docker compose -f docker-compose.prod.yml up -d --force-recreate caddy does. Burn into muscle memory: any Caddyfile change → --force-recreate caddy
DNS-01 over HTTP-01 deliberate — no port 80, no scrambling around HTTP-01 timing
Healthcheck endpoint is :2020, not :2019. :2019 is admin API (disabled). Match :2020 if you ever hand-write a probe
X-Forwarded-For must derive from cf-connecting-ip, not raw header — CF strips original. Caddyfile translates back so backend logs see real IPs (rate-limiting, fraud, audit logs depend on it)
First cert acquisition takes ~30-60 s. No cert after 2 min → check docker logs caddy for ACME errors (usually CLOUDFLARE_API_TOKEN permission or DNS not propagated)
Do NOT add encode gzip to the API block. SSE keepalive frames never fill the gzip buffer → connection hangs. CF edge handles compression with the browser end-to-end; Caddy-side compression is redundant AND breaks streaming

Cloudflare¶

Free Universal SSL covers depth-1 subdomains only. dozzle-staging.sapari.io works free; dozzle.staging.sapari.io (depth-2) requires Advanced Cert Manager (paid). Keep ops subdomains depth-1
Grey vs orange cloud is consequential. Pre-firewall, backend domains (api.*) MUST be grey (Caddy terminates TLS, double-proxy breaks the cert flow). Post-firewall + Phase 6.3, api.* flips to orange to lock origin to CF's edge
CF API token scope creep: scope to Zone:DNS:Edit on sapari.io zone only. Token scope can't be narrowed retroactively
Custom domain attachment for the Worker is in dashboard, NOT wrangler.toml. Declaring [[routes]] + Pages binding = error 10144
HTTP/3 + SSE behind orange-cloud → ERR_QUIC_PROTOCOL_ERROR. Disable HTTP/3 zone-wide before flipping api.* orange

R2 + Worker¶

Buckets must pre-exist before Worker deploys. Wrangler doesn't create buckets; binds to existing. Missing bucket = cryptic "no such binding" runtime error
MEDIA_TOKEN_SECRET byte-identity is non-negotiable. Backend HS256-signs, Worker HS256-verifies. One byte different → 100% playback 401s. Verification protocol in Phase 4.4
Worker route order is first-match-wins. /media/v1/* before /api/* (Phase 4.5)
wrangler deploy may report "No deploy targets" — normal. With workers_dev = false and no [[routes]], use npx wrangler versions deploy --env <env>
Versioning toggle is per-bucket, manual, in dashboard. Once a file is deleted in a non-versioned bucket, no recovery path

Backend env vars¶

Production security validator hard-fails on: weak SECRET_KEY (<32 chars), POSTGRES_PASSWORD=postgres, CREATE_TABLES_ON_STARTUP=true. Soft-warns on: Redis without password, CORS *, DEBUG=true, docs in prod, weak admin creds, sessions >120 min
OAUTH_REDIRECT_BASE_URL: root only, no path, no trailing slash. Code appends /api/v1/auth/oauth/callback/<provider>. Adding a path doubles it; trailing slash breaks some providers' callback registration
MEDIA_PROXY_BASE_URL: must match user-facing domain (https://app.sapari.io), not backend domain. Wrong value → clips try to play from api.sapari.io (no Worker route there) and 404
CORS_ORIGINS must include the landing origin. Newsletter signup form on sapari.io POSTs to api.sapari.io/api/v1/newsletter/subscribe; if sapari.io isn't in CORS_ORIGINS, the browser blocks the request and the form renders "Misconfigured"
Three Redis DBs share one container by design (cache=0, rate-limiter=1, sessions=2, taskiq result=3). Don't consolidate
RabbitMQ user MUST NOT be guest/guest. Override via RABBITMQ_DEFAULT_USER/RABBITMQ_DEFAULT_PASS in compose env, plus TASKIQ_RABBITMQ_USER/_PASSWORD in .env. Generate with openssl rand -hex 32

Database & migrations¶

Neon Launch tier is required for prod, not Free
Direct endpoint over pooled — convention #16. Pooled endpoint uses PgBouncer in transaction mode, breaks asyncpg's prepared-statement cache (3-4× round-trips per query). Take direct, accept rare connection blip
Two-deploy rule for destructive migrations. Drop column? Rename? Ship code change first (reads neither old nor new), let it soak, ship migration second. Doing both in one deploy = rollback to prior image incompatible with new schema → stuck
CONFIRM_PRODUCTION_MIGRATION=yes is the env-py prod gate. Deploy scripts pass automatically. Manual migrations on server need it explicitly
Alembic head baked into image labels. rollback.sh reads LABEL sapari.alembic_head from target image, compares against live DB. Mismatch → abort
Neon time-travel = 6-hour restore window (free tier). Branch → Restore → pick a timestamp. Applies in ~30 s

TaskIQ + RabbitMQ¶

rabbitmq_delayed_message_exchange plugin is REQUIRED. Without it, SmartRetryMiddleware's exponential backoff is silently ignored — failed downloads retry immediately, hit same error, get dropped. Plugin enabled via rabbitmq/Dockerfile
Broker (RabbitMQ) and result backend (Redis) deliberately separate. Redis crash → results lost, queue survives, tasks retry. Consolidated would lose both on Redis incident
Priority queue mapping: viral=3, creator=2, hobby=1, free=0. Set via .kicker().with_labels(priority=N).kiq(...). New tier means updating both code enum AND queue priority levels in compose
No eager tasks import in module __init__.py (Convention #17). Causes a circular import via infrastructure.taskiq → workers/shared/context → modules.email.service → modules.email.__init__ → tasks → infrastructure.taskiq mid-load. Workers crash-loop silently under taskiq's process manager (containers report Up, RestartCount=0, but no work progresses). CI grep enforces

Stripe¶

Test-mode keys on prod = silent failure. Live-mode keys on staging = real charges
Webhook signing secret per-endpoint, not per-account. Rotating endpoint generates new secret; old stops working
Webhook signing secret mismatch is silent. Backend returns 401, Stripe retries a few times then gives up. User paid; sees no entitlements. Symptom appears 5-30 min post-charge
Idempotency keys on Stripe API write calls — backend uses these. Add new write paths with idempotency keys
Tier 3 + Tier 4 products are auto-seeded on first deploy. Don't pre-create
_to_dict() boundary helper in webhooks.py converts Stripe's StripeObject to a plain dict at the webhook entry point. Required because StripeObject is dict-like but not a dict — downstream type-checked code (FastCRUD, Pydantic models) raises on direct pass-through

Postmark¶

DKIM + SPF + Return-Path: all three or none. Verify all green before deploying
Sender reputation builds slowly. Don't blast 1000+ users on day 1 — drip
Separate Postmark server token per environment. Sharing muddles deliverability stream
Email broker has no SmartRetry. A Postmark / RabbitMQ outage during POST /newsletter/subscribe leaves the subscriber row in PENDING. The _recover_pending_newsletter_subscribers cron sweep is the recovery path (re-queues confirmation emails for rows >30 min old, max 3 attempts)
CAN-SPAM postal address + entity name in base.html + base.txt footers are legally required. The 5-case test_template_footer.py pins the copy

Observability¶

SQLAlchemy instrumentation: ON. Redis: OFF. SQLAlchemy spans solve "why is this endpoint slow" (high signal). Redis ops sub-millisecond, uniform (high volume, low signal). Per-worker kill-switches exist
Span taxonomy = 6 categories: pipeline.parent / step.* / taskiq.* / service.* / ext.* / cron. Hand-named spans break dashboard filters
Per-worker service.name distinguishes workers in queries. Web=sapari-api; workers=sapari-{email,analysis,render,download,proxy,asset-edit}; scheduler=sapari-scheduler
Beszel shoutrrr Discord URL format — token first, then ID, in discord:// URI form

CI/CD¶

Image tag strategy: floating production tag for normal deploys, SHA tag for rollbacks. Always pin SHA in rollback.yml inputs
Public GHCR images today; flip to private requires docker login
Migrations run BEFORE container restart in deploy.sh. Migration failure aborts deploy; old code keeps running on old schema. Don't reorder
git clone on the server, not rsync-only. Operators can SSH in, edit docker-compose.prod.yml, restart without CI. When CI is down, the server's git repo is your unblock

DNS + cutover¶

Lower TTL on api.sapari.io to 60s before cutover (Phase 2.2). Bump to Auto in Phase 6 once stable
CF Pages CNAME flattening is automatic at apex. sapari.io shows as A record, not CNAME. CF feature, not bug
Brief downtime per deploy is accepted. No blue/green, no Swarm. Plain docker compose up -d → ~10 s API restart → frontend shows maintenance screen → React Query retries on resume

Rollback Plan¶

Three layers.

Code rollback (most common):

ssh deploy@<hetzner-ip>
cd ~/sapari
./scripts/deployment/rollback.sh <previous-sha>

Aborts if new image's Alembic head differs from live DB. Pass --ignore-migration-warning if schema is backwards-compatible.

Schema rollback (Neon time travel): - Free/Launch tier: 6-hour restore window - Neon dashboard → Branches → Restore to a point in time — applies in ~30 seconds - Use this if alembic downgrade -1 isn't safe

DNS-level rollback (last resort): - TTL was 60s during cutover; bump back to old IP via CF DNS edit - For "the new server is fundamentally broken" — almost never the right answer if Caddy/health checks are green but app behavior is wrong

Environment Variable Reference¶

Full .env template for production. Bold = enforced by security validator (app won't start without it set correctly).

# === App ===
ENVIRONMENT=production
DEBUG=false
SECRET_KEY=<openssl rand -base64 64>             # 32+ chars
FRONTEND_URL=https://app.sapari.io
API_PUBLIC_URL=https://api.sapari.io             # newsletter email link base
LANDING_URL=https://sapari.io                    # newsletter redirect target
CONTACT_EMAIL=hello@sapari.io
LOG_LEVEL=INFO

# === Image registry ===
GHCR_OWNER=benavlabs
IMAGE_TAG=production

# === Database (Neon Launch tier, direct endpoint) ===
DATABASE_URL=postgresql+asyncpg://<user>:<password>@<direct-endpoint>/<db>?ssl=require
# Note: use ?ssl=require — NOT ?sslmode=require (psycopg2 syntax). Drop any
# &channel_binding=require Neon's UI may copy in — asyncpg auto-negotiates SCRAM.
CREATE_TABLES_ON_STARTUP=false
POSTGRES_POOL_SIZE=20                            # per-process
POSTGRES_MAX_OVERFLOW=10                         # per-process

# === Redis ===
CACHE_BACKEND=redis
CACHE_REDIS_HOST=redis
CACHE_REDIS_PASSWORD=<random>
RATE_LIMITER_BACKEND=redis
RATE_LIMITER_REDIS_HOST=redis
RATE_LIMITER_REDIS_DB=1
RATE_LIMITER_REDIS_PASSWORD=<random>
SESSION_BACKEND=redis
SESSION_REDIS_HOST=redis
SESSION_REDIS_DB=2
SESSION_REDIS_PASSWORD=<random>
SESSION_SECURE_COOKIES=true
SESSION_TIMEOUT_MINUTES=480

# === RabbitMQ ===
TASKIQ_BROKER_TYPE=rabbitmq
TASKIQ_RABBITMQ_HOST=rabbitmq
TASKIQ_RABBITMQ_USER=sapari                      # NOT guest
TASKIQ_RABBITMQ_PASSWORD=<random>                # NOT guest
TASKIQ_REDIS_HOST=redis
TASKIQ_REDIS_DB=3

# === Storage (R2) ===
STORAGE_ENDPOINT=https://<account-id>.r2.cloudflarestorage.com
STORAGE_ACCESS_KEY_ID=<from R2 dashboard>
STORAGE_SECRET_ACCESS_KEY=<from R2 dashboard>
STORAGE_BUCKET_RAW=sapari-raw
STORAGE_BUCKET_EXPORTS=sapari-exports
STORAGE_BUCKET_ASSETS=sapari-assets
STORAGE_MAX_UPLOAD_SIZE_MB=10240                 # 10 GB

# === Video duration cap ===
MAX_VIDEO_DURATION_MINUTES=90                    # aligns with render timeout

# === Media proxy (CF Worker) ===
MEDIA_TOKEN_SECRET=<openssl rand -base64 32>     # MUST match Worker secret byte-for-byte
MEDIA_TOKEN_KID=v1
MEDIA_TOKEN_TTL_SECONDS=300
MEDIA_PROXY_BASE_URL=https://app.sapari.io

# === CORS ===
# Must include landing origin or newsletter signup POST is CORS-blocked.
CORS_ORIGINS=https://app.sapari.io,https://sapari.io
CORS_ALLOW_CREDENTIALS=true

# === Caddy / TLS ===
CLOUDFLARE_API_TOKEN=<scoped Zone:DNS:Edit on sapari.io>

# === OAuth ===
OAUTH_REDIRECT_BASE_URL=https://app.sapari.io
OAUTH_GOOGLE_CLIENT_ID=<from Google Cloud Console>
OAUTH_GOOGLE_CLIENT_SECRET=<from Google Cloud Console>
OAUTH_GITHUB_CLIENT_ID=<from GitHub OAuth App>
OAUTH_GITHUB_CLIENT_SECRET=<from GitHub OAuth App>

# === Stripe (live mode) ===
STRIPE_SECRET_KEY=sk_live_...
STRIPE_PUBLISHABLE_KEY=pk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
STRIPE_TEST_MODE=false

# === Email (Postmark) ===
POSTMARK_SERVER_TOKEN=<from Postmark>
EMAIL_SENDER_ADDRESS=hello@sapari.io
EMAIL_SENDER_NAME=Vitoria from Sapari
EMAIL_TEST_MODE=false

# === AI ===
OPENAI_API_KEY=sk-...
DEEPSEEK_API_KEY=...

# === Admin ===
ADMIN_USERNAME=<not 'admin'>
ADMIN_EMAIL=<address on sapari.io>
ADMIN_PASSWORD=<12+ chars, strong>
ADMIN_EMAIL_DOMAIN=sapari.io

# === Observability ===
LOGFIRE_TOKEN=<from logfire.pydantic.dev>
LOGFIRE_ENVIRONMENT=production
LOGFIRE_SERVICE_NAME=sapari-api

# === Feature flags / safety ===
PRODUCTION_SECURITY_VALIDATION_ENABLED=true
ENABLE_DOCS_IN_PRODUCTION=false

# === Resource limits (parameterized in docker-compose.prod.yml) ===
# Defaults in compose match staging (4 vCPU / 16 GB). These overrides are for prod (CPX62).

WEB_WORKERS=4
WEB_MEMORY=6g
WEB_CPUS=4.0
WEB_MEMORY_RESERVATION=1g

RENDER_MEMORY=6g
RENDER_CPUS=4.0
RENDER_FFMPEG_THREADS=4
RENDER_MEMORY_RESERVATION=2g

PROXY_MEMORY=3g
PROXY_CPUS=3.0
PROXY_FFMPEG_THREADS=3

DOWNLOAD_MEMORY=2g
DOWNLOAD_CPUS=2.0

ANALYSIS_MEMORY=2g
ANALYSIS_CPUS=1.5
ANALYSIS_TASKIQ_CONCURRENCY=4

REDIS_MAXMEMORY=500mb
REDIS_MEMORY=768m
REDIS_CPUS=0.5

RABBITMQ_MEMORY=1g

CADDY_MEMORY=256m
CADDY_CPUS=0.25