Media Token Secret Rotation¶
Annual (plus on-incident) rotation of the MEDIA_TOKEN_SECRET used by
MediaTokenService (backend) and the Cloudflare Worker at /media/v1/*.
This doc is the runbook — follow it as written.
Why rotate¶
Two reasons:
- Exercise the mechanism. The procedure must work at 3am during an actual compromise. A rotation that's never been run will fail when it matters. Annual keeps the muscle memory alive without being busywork.
- Shrink leak window. If a secret ever escapes (log dump, env var accidentally committed, etc.), a recent rotation invalidates old tokens faster.
HS256 with a 32-byte random secret has no meaningful cryptographic wear — the math is fine indefinitely. Rotation is about operational readiness, not mathematical hardening.
Prerequisites (one-time, already in place)¶
These were built into the service at Stage 2 so rotation is a pure ops procedure — no code change required:
- JWT header always includes
kid(key id). Backend encodes with the currentMEDIA_TOKEN_KID; Worker picks the secret from akid → secretregistry based on the incoming token's header. verify()pinsalgorithms=["HS256"]explicitly — never trustsheader.alg. Prevents the classicalg: noneattack.- Startup registry log on both backend and Worker:
media_token: active=<kid> registry=[kid:fp, ...]at INFO.<fp>is the first 8 hex chars ofsha256(secret). Drift diagnostic is an eyeball-diff between the two streams — no tooling needed.
Rotation procedure (~30 minutes)¶
Step 1 — Mint the new secret¶
NEW_SECRET=$(openssl rand -base64 32)
echo "$NEW_SECRET" | head -c 10 # Sanity check: see first 10 chars
Store the full value somewhere retrievable for the next ~25 minutes (1Password, Bitwarden, whatever). It gets copy-pasted into two places below and then never typed again.
Step 2 — Stage on Worker first, then backend¶
Order matters. Stage the Worker first so that if the backend starts
minting with kid=v2 before the Worker knows about it, the Worker already
has the secret ready to verify. The reverse order creates a window where
fresh tokens fail to verify.
# Worker (production): add v2 secret; keep v1 in place
cd worker/
npx wrangler secret put MEDIA_TOKEN_SECRET_V2 --env production
# Paste $NEW_SECRET when prompted
# Verify the Worker deployed with both secrets — next cold-start log must show:
# media_token: active=v1 registry=[v1:<fp1>, v2:<fp2>]
npx wrangler tail --env production --format pretty | grep media_token
Wait for at least one media_token log line before moving on. If only v1
appears in the registry, the Worker hasn't picked up the new secret — retry
the wrangler secret put and force a redeploy if needed.
Step 3 — Flip the backend¶
Update the production env file / secret manager:
(Or whatever secret-management flow the production backend uses. For Hetzner,
edit the .env and restart the backend service.)
Verify fingerprints match across backend and Worker:
# Backend (on the production server):
journalctl -u sapari-backend --since "5 minutes ago" | grep media_token
# Expect: media_token: active=v2 registry=[v2:<fp2>]
# Worker:
npx wrangler tail --env production --format pretty | grep media_token
# Expect: media_token: active=v1 registry=[v1:<fp1>, v2:<fp2>]
The v2:<fp2> hash must be identical on both sides. If it isn't, the two
environments have different MEDIA_TOKEN_SECRET values — stop and reconcile
before proceeding.
Step 4 — Pin a calendar reminder at +24h¶
During the 24h transition window, every Worker redeploy MUST retain both
secrets in the env. A redeploy that drops MEDIA_TOKEN_SECRET_V1 in the
middle of this window invalidates all in-flight tokens signed with v1 (the
frontend retry layer refetches, so it's not user-catastrophic — but it's
avoidable noise).
Set a reminder in your calendar NOW titled "Remove MEDIA_TOKEN_SECRET_V1 from Worker" with the URL of the Worker dashboard. Don't rely on memory.
If you deploy for any reason during the window, verify after each deploy:
npx wrangler tail --env production --format pretty | grep media_token
# Both v1 and v2 must still appear in the registry
Step 5 — Remove v1 (after 24h)¶
The 24h wait is to let any v1-signed tokens in flight (already issued to clients) expire naturally. Max TTL is 5 minutes, so 24h is wide safety margin.
# Worker: remove v1 secret
cd worker/
npx wrangler secret delete MEDIA_TOKEN_SECRET_V1 --env production
# Verify:
npx wrangler tail --env production --format pretty | grep media_token
# Expect: media_token: active=v2 registry=[v2:<fp2>]
Tokens still in flight with kid=v1 now fail verification at the Worker
with 401 → frontend's refreshTokenAndResume handler refetches a new URL
with kid=v2. Users see at most one stall frame.
Backend can also drop MEDIA_TOKEN_KID_V1 from any registry extension code
(Stage 2 launch state has only v1, so there's nothing to drop yet; later
rotations will have more kids to clean up).
Step 6 — Document the completed rotation¶
Update INFRASTRUCTURE_PROVISIONING_PLAN.md §1.2 with the new active kid.
Add a line to the rotation log below.
Rotation log¶
Append to this list on each completed rotation. The point is audit trail — "when was the last rotation" is a question that comes up during incidents.
| Date | From | To | Operator | Notes |
|---|---|---|---|---|
| (first rotation — add entry here) |
On-incident rotation (compromise suspected)¶
Same procedure as above, with Step 4's 24h wait shortened or skipped depending on risk:
- Confidential compromise (log dump, screenshot of env, secret in a commit that was reverted but may have been scraped): wait 5 minutes after Step 3 to let in-flight tokens expire naturally, then Step 5. Users retry once via the frontend handler.
- Active compromise (you know an attacker is using the old secret right now): skip directly to Step 5 after Step 3. Delete v1 immediately. All in-flight tokens (legitimate AND attacker-held) become invalid at once. Users retry once; the attacker has to steal a fresh session to get a new token.
In either case, file an incident ticket, rotate any secrets the compromised material could have touched (SECRET_KEY for sessions, database creds, etc.), and review access logs for the blast radius.
Related¶
cloudflare-workers.md— full deploy + routing runbook for the Worker (route patterns, dashboard-managed bindings, troubleshooting). This rotation runbook assumes the Worker is already provisioned; that doc is where to look if something's broken outside the rotation procedure.R2_MEDIA_PROXY_PLAN.md— overall architecture of the Worker-fronted media proxy. This runbook is the canonical rotation procedure; the plan points back here for the actual steps.backend/src/infrastructure/media_proxy/service.py—MediaTokenServicewith thekid → secretregistry.worker/src/index.ts— Worker-side registry (added in Stage 3).