Skip to content

Monitoring

This page covers observability for Sapari in production.

Logging

We use structured logging via Logfire (Pydantic's observability platform). Logs include:

  • Request/response for API calls
  • Task execution for workers
  • Error tracebacks with context
from src.infrastructure.logging import get_logger

logger = get_logger()
logger.info("Processing clip", clip_uuid=str(clip.uuid), status=clip.status)

Metrics

Key metrics to monitor:

Metric What to Watch
API latency (p99) Should be < 500ms for most endpoints
RabbitMQ queue depth Growing queue = workers can't keep up (check per priority level)
Task duration Transcription ~1-2min per minute of video
Error rate Spike = something broken
Storage usage R2 bucket sizes

Alerts

Set up alerts for:

  • API error rate > 1%
  • Task failure rate > 5%
  • RabbitMQ queue depth > 100 for any broker (check via management API at :15672)
  • RabbitMQ consumer count drops to 0 for any queue
  • Database connection pool exhaustion (most common cause: a streaming endpoint holding a DB-bearing dependency for the response lifetime — see backend.md Convention #16)
  • Redis memory > 80% (cache + sessions + result backend)

Tracing

Logfire provides distributed tracing across API requests and background tasks. Each request/task gets a trace ID that links:

  • API handler execution
  • Database queries
  • External API calls (Whisper, LLM)
  • Task queue operations

Per-worker service names

Each process reports a distinct service.name so Logfire UI can filter spans by worker role. The FastAPI web process is sapari-api; each TaskIQ worker gets its own name driven by startup_taskiq_worker(state, service_name):

Process service.name Configured in
FastAPI web sapari-api backend/.envLOGFIRE_SERVICE_NAME
Email worker sapari-email infrastructure/taskiq/app.py:80
Analysis worker sapari-analysis infrastructure/taskiq/app.py:81
Render worker sapari-render infrastructure/taskiq/app.py:82
Download worker sapari-download infrastructure/taskiq/app.py:83
Proxy worker sapari-proxy infrastructure/taskiq/app.py:84
Asset-edit worker sapari-asset-edit infrastructure/taskiq/app.py:85
Scheduler sapari-scheduler infrastructure/taskiq/scheduler.py

Add service.name != 'sapari-api' to any Logfire query to scope to worker traffic. Filter to a specific worker via service.name = 'sapari-render' for per-queue latency investigations.

Span taxonomy (six categories)

Manual spans follow a six-category naming convention so Logfire filters, saved queries, and ad-hoc grouping stay legible. Every new manual span must fit one of these:

Category Pattern Example When to use
Pipeline parent <worker>.pipeline analysis.pipeline, render.pipeline Wraps the full DAG invocation inside a task body. Has project_uuid / export_uuid correlation keys.
Pipeline step step.<step_id> (auto) step.download_clips, step.apply_cuts Emitted automatically by fastroai.LogfireTracer — don't hand-write.
Task taskiq.<task_name> (auto) taskiq.analyze_project, taskiq.cleanup_expired_exports Emitted automatically by LogfireSpanMiddleware — don't hand-write. Cron tasks carry task_type="cron" attribute (see Convention #15).
Service-layer <module>.<method> credits.reserve, entitlement.grant, payment.handle_webhook Wrap business operations that span multiple CRUD calls, so per-operation latency is queryable distinct from any single SQL query.
External call ext.<service>.<op> ext.whisper.transcribe, ext.ffmpeg.render, ext.ytdlp.download, ext.stripe.modify_subscription Wrap every outbound call to a non-owned service so its p99 can be isolated from surrounding app code.
Cron uses taskiq.<task_name> with task_type="cron" attr Saved query: filter task_type = 'cron' to isolate scheduled runs from on-demand queue traffic.
SSE sse.<event> sse.publish, sse.clip_ready Wrap publish paths so event-emission latency is distinct from the originating task.

Attribute conventions. Every ext.* span should carry a domain correlation key at span entry (clip_file_uuid, asset_file_uuid, user_asset_uuid, subscription_id, etc.) so a slow call can be traced back to the originating resource without reading surrounding code. <module>.<method> spans carry the operation's primary identifier (user_id, event_type, project_uuid) plus a post-attr via span.set_attribute() that captures the outcome (reserved=true/false, entitlements_drained=N, payments_materialized=N).

Why manual spans over pure auto-instrumentation. Logfire auto-instruments SQLAlchemy, Redis, Pydantic AI, and FastAPI. Those are always on and produce the "zoom in" view. Manual spans produce the "zoom out" view — the thing a dashboard wants ("p99 of credits.reserve") rather than the raw SQL or Redis calls that composed it. Auto-instrumentation is necessary but not sufficient for app-level SLO work.

Auto-instrumentor defaults

Not every Logfire auto-instrumentor earns its keep. Current defaults:

Auto-instrumentor Web (sapari-api) Workers (sapari-*) Env var Rationale
instrument_pydantic_ai ✅ ON ✅ ON LOGFIRE_INSTRUMENT_PYDANTIC_AI Highest-value auto-signal. LLM calls are the biggest unit cost in the app — every invocation needs token counts, cost, model, prompt hash for debugging + billing reconciliation. Keep on unconditionally.
instrument_sqlalchemy ✅ ON ✅ ON LOGFIRE_INSTRUMENT_SQLALCHEMY / LOGFIRE_INSTRUMENT_SQLALCHEMY_WORKERS Per-query spans let you debug "why is this endpoint p99 slow" without adding manual spans during the incident. Neon's query insights cover aggregate patterns, but don't correlate back to user-visible request latency. Kept on despite span-volume cost because the debugging value wins during incidents.
instrument_redis ❌ OFF ❌ OFF LOGFIRE_INSTRUMENT_REDIS / LOGFIRE_INSTRUMENT_REDIS_WORKERS Redis ops in Sapari are uniformly sub-millisecond (cache, session, rate-limit, pub/sub). Real-world problems surface at the request-span level before they're visible at the Redis-op level. Span volume dominates (Redis is called per request × multiple ops) without proportional signal. Turn on temporarily via env var for Redis-specific investigations.
instrument_fastapi ✅ ON N/A (always on when app is instrumented) One span per HTTP request is the backbone — this is what every other span hangs off.

Rationale for keeping SQLAlchemy despite the cost: when staging hit an OOM + 100% CPU incident (2026-04-23), the time-to-diagnose was dominated by not having per-query visibility into what the hot worker was doing. Neon showed nothing unusual (the bug was in a pubsub spin loop, not a query). SQL spans wouldn't have caught that one specifically, but the adjacent question "is a runaway query involved" is the first thing you ask when CPU pins, and having SQL spans present answers it in 30 seconds instead of 30 minutes. That asymmetry justifies the span volume.

Rationale for turning Redis off: the SSE subscriber spin-loop bug (fixed 2026-04-23) was emitting ~50-100k Redis get_message spans per second per open SSE connection. Redis instrumentation was effectively amplifying an already-pathological CPU pattern with per-call span overhead. Redis ops are sufficiently cheap and uniform that production issues surface at the request-level span first; there's no class of bug where "per-Redis-op span" is the load-bearing signal.

How to flip them temporarily. The env vars map to the on/off toggle. For a focused investigation:

# On the staging or prod server, edit .env:
LOGFIRE_INSTRUMENT_REDIS=true
LOGFIRE_INSTRUMENT_REDIS_WORKERS=true

# Restart backend + affected workers
docker compose up -d sapari-backend sapari-taskiq-<worker>

Remember to flip back after the investigation.

Media proxy observability (R2 Stage 5)

Two instrumented signals feed the R2 Stage 5 dashboards. See OBSERVABILITY.md for the full plan.

Backend mint latency (Logfire)

MediaTokenService.mint() is wrapped in a logfire.span("media_token.mint", ...) with attributes bucket, user_id, kid, expires_in. Span timing captures latency automatically. Exceptions raised inside the span are recorded as the span's error.

Saved queries to create in the Logfire UI (Workspaces → Saved queries):

Query name Filter Purpose
media_token.mint p50/p95/p99 (24h) span_name = 'media_token.mint', time = last 24h Stage 5 mint-latency gate
media_token.mint error rate (24h) span_name = 'media_token.mint' AND status = 'error' Spikes = mint path broken

Plan decision rule (OBSERVABILITY.md Phase 6): if p99 mint < 500ms AND p99 Worker verify < 100ms, drop PROXY_URL.REFRESH_BUFFER_SECONDS from 30 to 15.

Worker reject-reason logs (structured JSON)

/media/v1/* rejections emit single-line JSON to the Worker log stream. Shape:

{"type": "media_verify_rejected", "reason": "expired", "url": "https://staging.sapari.io/media/v1/<redacted>"}

Reasons are an enum (worker/src/constants.ts:RejectReason): - malformed — JWT header couldn't be parsed. - unknown_kidkid missing or not in the active registry. - expired — token expired. Expected on long sessions; frontend refreshes via 401. - invalid_signature — signature verification failed or issuer/claims mismatch. Real attacks land here — investigate if rate exceeds a trickle. - unknown_bucketbkt claim doesn't resolve to a known R2 binding. Config drift. - r2_miss — token verified but R2 has no object at key. Often a deleted-file + dangling-DB-reference bug.

To aggregate by reason on staging:

wrangler tail --env staging --format json \
  | jq -c 'select(.logs[]?.message[0] | fromjson? | .type == "media_verify_rejected")' \
  | jq -c '.logs[].message[0] | fromjson' \
  | jq -s 'group_by(.reason) | map({reason: .[0].reason, count: length})'

Tail for 24h after a deploy, then diff against the cf-analytics-*.md baseline — the sum of counts should ≈ the baseline's 4xx. If a large fraction lands in invalid_signature or r2_miss, root-cause before Stage 6.

Ops tools (Phase 1: Dozzle + Beszel)

Two self-hosted lightweight ops tools running on staging, gated behind Cloudflare Access. Both are operator-facing only — neither is a replacement for Logfire (application traces stay there).

URLs + access

Tool URL What it's for
Dozzle https://dozzle-staging.sapari.io Live container log tailing across every service. No retention.
Beszel https://beszel-staging.sapari.io System + container metrics with threshold alerts.

Auth: both gated by the Sapari Ops Cloudflare Access app (Zero Trust dashboard → Access → Applications). Identity provider is GitHub OAuth, attached to the devs only policy that gates by benavlabs/Sapari GitHub team membership. To grant a new team member access, add them to the GitHub team — no per-tool grant required.

Per-env hostnames + TLS are interpolated into the Caddyfile at deploy time from compose-level env vars: OPS_DOZZLE_DOMAIN, OPS_DOZZLE_TLS, OPS_BESZEL_DOMAIN, OPS_BESZEL_TLS. Set them in the env file (.env.staging / .env.production) so the same Caddyfile serves dozzle-staging.sapari.io on staging and a different hostname on production. When unset, the Caddyfile falls back to *.placeholder.invalid + tls internal — the vhost blocks parse but never serve real traffic.

Alert thresholds (configured in Beszel hub UI per system)

Alert Threshold Window Notes
Status (system up/down) on transition 2 min Filters out deploy-related blips (agent reconnects within ~30s after dc up -d).
CPU usage > 80% 5 min Average across cores.
Memory usage > 85% 5 min Backend container limit is 2 GB — 85% = 1.74 GB.
Disk usage > 87% 5 min ≈10 GB free on the 80 GB CCX23 staging disk; recompute if disk size changes.

Notification destination: Discord webhook configured under Beszel's Notification Settings in shoutrrr format discord://<token>@<webhook-id> (NOT the raw HTTPS Discord URL — shoutrrr expects its own URL scheme).

Runbook: Dozzle is unreachable

  1. CF Access page loads but 502 from Caddy. Container is down. SSH to staging, run docker compose -f docker-compose.prod.yml --env-file .env ps dozzle. Restart with docker compose -f docker-compose.prod.yml --env-file .env up -d dozzle.
  2. CF Access page doesn't load (browser shows CF error 525 or similar TLS error). Caddy is missing the cert for dozzle-staging.sapari.io. Check docker compose logs caddy --tail 30 for ACME issuance errors. If certs need re-issuance, force-recreate Caddy (see "Caddyfile changes" runbook below).
  3. ERR_CONNECTION_REFUSED in browser. DNS hasn't propagated, or CF proxy is gray cloud (must be orange). Check the DNS record's proxy status in Cloudflare dashboard.

Runbook: Beszel agent disconnected (red dot or system "Down" in hub UI)

  1. Container running? docker compose ps beszel-agent. If restart-looping, KEY or TOKEN is wrong (or the system was deleted from the hub UI).
  2. Logs. docker compose logs beszel-agent --tail 30. "WebSocket connected" = healthy. Auth-failure messages = re-bootstrap needed.
  3. Re-bootstrap if needed. Hub UI → systems list → if the system entry was deleted, "Add System" again to get a fresh KEY + TOKEN. Update BESZEL_AGENT_KEY and BESZEL_AGENT_TOKEN in ~/sapari/.env on the deploy server. Run docker compose -f docker-compose.prod.yml --env-file .env up -d beszel-agent. Watch for "WebSocket connected" in logs.

Runbook: Caddyfile changes don't take effect after deploy

The deploy script does git reset --hard, which atomically replaces caddy/Caddyfile via tempfile-rename. This changes the file's inode. Caddy's compose config bind-mounts the file (not the directory) at ./caddy/Caddyfile:/etc/caddy/Caddyfile:ro. Single-file Docker bind mounts are inode-locked, so the running Caddy container keeps serving the old file content even though the path inside the container resolves to the same host path. docker compose restart caddy does NOT fix this; only docker compose up -d --force-recreate caddy does (new container re-resolves the mount).

Recipe after any Caddyfile edit:

ssh deploy@<staging>
cd ~/sapari
docker compose -f docker-compose.prod.yml --env-file .env up -d --force-recreate caddy
docker compose -f docker-compose.prod.yml --env-file .env logs caddy --tail 30 | grep -E "obtain|certificate"  # watch cert issuance

Caddy will re-request Let's Encrypt certs for any new vhosts via DNS-01 (uses the CLOUDFLARE_API_TOKEN env). Issuance typically takes ~30s per cert.

⚠ Do not delete beszel-data/ while agents are connected

The ./beszel-data/ bind-mount on the deploy server holds the hub's PocketBase DB AND the auto-generated keypair. Deleting it regenerates the keypair and silently invalidates every connected agent's KEY env var — every agent goes red and re-bootstrap is required for each one. To clean per-system data while preserving auth, use the hub UI (Systems → select → Delete). To wipe everything intentionally, expect to re-run §5 Stage 5 of observability-phase-1-plan.md for every connected agent afterward.

Key Files

Component Location
Logging setup backend/src/infrastructure/logging/__init__.py
Logfire config backend/src/infrastructure/observability/__init__.py
Media token span backend/src/infrastructure/media_proxy/service.py:mint
Worker reject enum worker/src/constants.ts:RejectReason
Worker reject log worker/src/media.ts:logRejected
Ops tools compose blocks docker-compose.prod.yml (services: dozzle, beszel-hub, beszel-agent)
Ops tools Caddy vhosts caddy/Caddyfile (dozzle-staging.sapari.io, beszel-staging.sapari.io)
Beszel data directory (server-only, gitignored) ~/sapari/beszel-data/ on the deploy server

← Deployment Migrations →