Skip to content

Media Token Secret Rotation

Annual (plus on-incident) rotation of the MEDIA_TOKEN_SECRET used by MediaTokenService (backend) and the Cloudflare Worker at /media/v1/*. This doc is the runbook — follow it as written.

Why rotate

Two reasons:

  1. Exercise the mechanism. The procedure must work at 3am during an actual compromise. A rotation that's never been run will fail when it matters. Annual keeps the muscle memory alive without being busywork.
  2. Shrink leak window. If a secret ever escapes (log dump, env var accidentally committed, etc.), a recent rotation invalidates old tokens faster.

HS256 with a 32-byte random secret has no meaningful cryptographic wear — the math is fine indefinitely. Rotation is about operational readiness, not mathematical hardening.

Prerequisites (one-time, already in place)

These were built into the service at Stage 2 so rotation is a pure ops procedure — no code change required:

  • JWT header always includes kid (key id). Backend encodes with the current MEDIA_TOKEN_KID; Worker picks the secret from a kid → secret registry based on the incoming token's header.
  • verify() pins algorithms=["HS256"] explicitly — never trusts header.alg. Prevents the classic alg: none attack.
  • Startup registry log on both backend and Worker: media_token: active=<kid> registry=[kid:fp, ...] at INFO. <fp> is the first 8 hex chars of sha256(secret). Drift diagnostic is an eyeball-diff between the two streams — no tooling needed.

Rotation procedure (~30 minutes)

Step 1 — Mint the new secret

NEW_SECRET=$(openssl rand -base64 32)
echo "$NEW_SECRET" | head -c 10  # Sanity check: see first 10 chars

Store the full value somewhere retrievable for the next ~25 minutes (1Password, Bitwarden, whatever). It gets copy-pasted into two places below and then never typed again.

Step 2 — Stage on Worker first, then backend

Order matters. Stage the Worker first so that if the backend starts minting with kid=v2 before the Worker knows about it, the Worker already has the secret ready to verify. The reverse order creates a window where fresh tokens fail to verify.

# Worker (production): add v2 secret; keep v1 in place
cd worker/
npx wrangler secret put MEDIA_TOKEN_SECRET_V2 --env production
# Paste $NEW_SECRET when prompted

# Verify the Worker deployed with both secrets — next cold-start log must show:
# media_token: active=v1 registry=[v1:<fp1>, v2:<fp2>]
npx wrangler tail --env production --format pretty | grep media_token

Wait for at least one media_token log line before moving on. If only v1 appears in the registry, the Worker hasn't picked up the new secret — retry the wrangler secret put and force a redeploy if needed.

Step 3 — Flip the backend

Update the production env file / secret manager:

MEDIA_TOKEN_SECRET=<new secret from step 1>
MEDIA_TOKEN_KID=v2

(Or whatever secret-management flow the production backend uses. For Hetzner, edit the .env and restart the backend service.)

Verify fingerprints match across backend and Worker:

# Backend (on the production server):
journalctl -u sapari-backend --since "5 minutes ago" | grep media_token
# Expect: media_token: active=v2 registry=[v2:<fp2>]

# Worker:
npx wrangler tail --env production --format pretty | grep media_token
# Expect: media_token: active=v1 registry=[v1:<fp1>, v2:<fp2>]

The v2:<fp2> hash must be identical on both sides. If it isn't, the two environments have different MEDIA_TOKEN_SECRET values — stop and reconcile before proceeding.

Step 4 — Pin a calendar reminder at +24h

During the 24h transition window, every Worker redeploy MUST retain both secrets in the env. A redeploy that drops MEDIA_TOKEN_SECRET_V1 in the middle of this window invalidates all in-flight tokens signed with v1 (the frontend retry layer refetches, so it's not user-catastrophic — but it's avoidable noise).

Set a reminder in your calendar NOW titled "Remove MEDIA_TOKEN_SECRET_V1 from Worker" with the URL of the Worker dashboard. Don't rely on memory.

If you deploy for any reason during the window, verify after each deploy:

npx wrangler tail --env production --format pretty | grep media_token
# Both v1 and v2 must still appear in the registry

Step 5 — Remove v1 (after 24h)

The 24h wait is to let any v1-signed tokens in flight (already issued to clients) expire naturally. Max TTL is 5 minutes, so 24h is wide safety margin.

# Worker: remove v1 secret
cd worker/
npx wrangler secret delete MEDIA_TOKEN_SECRET_V1 --env production

# Verify:
npx wrangler tail --env production --format pretty | grep media_token
# Expect: media_token: active=v2 registry=[v2:<fp2>]

Tokens still in flight with kid=v1 now fail verification at the Worker with 401 → frontend's refreshTokenAndResume handler refetches a new URL with kid=v2. Users see at most one stall frame.

Backend can also drop MEDIA_TOKEN_KID_V1 from any registry extension code (Stage 2 launch state has only v1, so there's nothing to drop yet; later rotations will have more kids to clean up).

Step 6 — Document the completed rotation

Update INFRASTRUCTURE_PROVISIONING_PLAN.md §1.2 with the new active kid. Add a line to the rotation log below.

Rotation log

Append to this list on each completed rotation. The point is audit trail — "when was the last rotation" is a question that comes up during incidents.

Date From To Operator Notes
(first rotation — add entry here)

On-incident rotation (compromise suspected)

Same procedure as above, with Step 4's 24h wait shortened or skipped depending on risk:

  • Confidential compromise (log dump, screenshot of env, secret in a commit that was reverted but may have been scraped): wait 5 minutes after Step 3 to let in-flight tokens expire naturally, then Step 5. Users retry once via the frontend handler.
  • Active compromise (you know an attacker is using the old secret right now): skip directly to Step 5 after Step 3. Delete v1 immediately. All in-flight tokens (legitimate AND attacker-held) become invalid at once. Users retry once; the attacker has to steal a fresh session to get a new token.

In either case, file an incident ticket, rotate any secrets the compromised material could have touched (SECRET_KEY for sessions, database creds, etc.), and review access logs for the blast radius.

  • cloudflare-workers.md — full deploy + routing runbook for the Worker (route patterns, dashboard-managed bindings, troubleshooting). This rotation runbook assumes the Worker is already provisioned; that doc is where to look if something's broken outside the rotation procedure.
  • R2_MEDIA_PROXY_PLAN.md — overall architecture of the Worker-fronted media proxy. This runbook is the canonical rotation procedure; the plan points back here for the actual steps.
  • backend/src/infrastructure/media_proxy/service.pyMediaTokenService with the kid → secret registry.
  • worker/src/index.ts — Worker-side registry (added in Stage 3).