Skip to content

Architecture Decisions

Load-bearing choices that shape the current system. Captured so future changes have context, not because the decisions are sacred — revisit them when the constraints change.


Two Neon projects, not two branches

Staging and production run in two separate Neon projects, not two branches of one project.

Neon's free tier gives 0.5 GB storage per project, shared across branches. Two branches in one project would split that 0.5 GB between staging and production — half the runway on each. Two projects get 0.5 GB each, and they're physically isolated: no chance of a staging experiment leaking into production.

Revisit when: we outgrow free tier and pay for one Neon plan anyway.


Direct Postgres endpoint, not the pooler

DATABASE_URL points at Neon's direct endpoint (ep-*.c-*.<region>.aws.neon.tech), not the pooled endpoint (ep-*-pooler...).

Neon's pooler is PgBouncer in transaction mode, which breaks asyncpg's prepared-statement cache. The workaround (setting statement_cache_size=0) forces 3-4× the round trips per query — measured at ~250ms vs ~65ms on a representative query. We already have SQLAlchemy pooling (POSTGRES_POOL_SIZE=50 + POSTGRES_MAX_OVERFLOW=20 = 70-connection ceiling per API instance), so PgBouncer in front is redundant.

Revisit when: we'd horizontally scale beyond one persistent server and benefit from a shared connection broker.


Match DB region to server region

Hetzner Hillsboro → Neon us-west-2. Hetzner Ashburn → Neon us-east-1.

Every query is one network round trip. Cross-region adds ~50-65ms per query. A single request running 4-6 sequential queries pays 200-400ms in pure DB latency before anything useful happens. Same region is sub-10ms.

This is a provision-time decision. Changing the DB region means creating a new Neon project and migrating (or recreating) data.


Single Docker image for all backend containers

API server, 5 taskiq workers, scheduler all run from ghcr.io/benavlabs/sapari-backend:<sha>. They differ only in the command (fastapi run vs taskiq worker analysis_broker etc.).

Tradeoff: code changes affect every container on redeploy, so rolling out just a worker fix restarts the API too. Accepted because it keeps build time low, deploy scripts simple, and avoids a registry with 7 near-identical images.

Revisit when: we need to deploy worker changes without touching the API (e.g., high-frequency worker-only hotfixes become common).


Cloudflare Worker as a same-origin API proxy

Frontend at staging.sapari.io / app.sapari.io. Backend at api-staging.sapari.io / api.sapari.io. The Worker at the frontend origin proxies /api/* to the backend.

Keeps requests same-origin, which lets us skip CORS configuration entirely and keeps the auth cookie SameSite=strict in production. Adds ~5ms per request (one extra edge hop) for real security and config savings.

Revisit when: we have a reason to serve the API directly from the frontend's origin (would require more invasive backend changes — probably never).


SSH to servers is tailnet-only

Port 22 is firewalled to the Tailscale subnet (100.64.0.0/10). No public SSH.

Eliminates the "bot brute-forcing your SSH port" threat entirely. CD workflows use the Tailscale GitHub Action to join the tailnet as ephemeral tag:ci nodes before connecting.

Cost: ~20s added to each deploy for tailnet connect, plus the Tailscale admin console complexity (ACL, OAuth client, tags). Worth it. The "port 22 open to the internet with key auth" posture is fine in theory but burns reputation against scrapers for no benefit.


Production deploys are manual, not auto-reviewed

deploy.yml auto-deploys staging on workflow_run after successful build. deploy-production.yml is workflow_dispatch only with a typed-YES confirm input.

GitHub's required-reviewer environment gate requires the Team/Pro plan, which we don't have on the private repo. workflow_dispatch + typed confirm is the free-tier substitute — a human has to deliberately run the workflow and type YES for the deploy to proceed. Equivalent intent to a reviewer click.

Revisit when: we upgrade to GitHub Team, at which point we can swap to proper reviewer gates.


Manual first deploy before wiring CD

We cut production over to the new infrastructure by hand the first time, not via CD.

The first deploy of a new environment surfaces environment bugs (missing secrets, wrong firewall rules, DNS timing, certificate issues). Doing it by hand isolates those from CD plumbing bugs — if the SSH+FFmpeg+Caddy+Neon+Stripe chain works once manually, automating it is mechanical. Debugging both at once is miserable.

CD automation goes live after the first successful manual deploy proves the architecture.


Accept brief downtime per deploy

docker compose up -d is stop-then-start per service, not a rolling restart. Each changed container has a few seconds of downtime on deploy.

Tolerated because: the frontend shows a maintenance screen on backend-down, the API restart is ~2-3s, TaskIQ workers have --ack-type when_executed for redelivery, and React Query retries automatically. Single-digit-user scale doesn't warrant Docker Swarm or blue-green.

Revisit when: traffic makes a few seconds of downtime visibly bad.


Git clone on the server, not rsync from CI

Each server has a full git clone at /home/deploy/sapari. CD does git pull + docker compose up -d. The compose file lives in the repo, not shipped separately.

Gives us emergency-edit capability — SSH in, change docker-compose.prod.yml, restart — when CI is down or something needs immediate triage. Also keeps the compose file version-controlled with the code it orchestrates.

Cost: ~50MB on disk, negligible.