Skip to content

Download Pipeline

When users upload a video or import from YouTube, we process it to extract audio, generate waveforms, and optionally create a web-compatible proxy.

Two Entry Points

YouTube Import

POST /api/v1/projects/{project_uuid}/clips/youtube-import
{ "url": "https://youtube.com/watch?v=..." }

Fetches video metadata synchronously (yt-dlp, wrapped in asyncio.to_thread + asyncio.wait_for with YOUTUBE_METADATA_TIMEOUT_SECONDS) and rejects the request with 400 ValidationError if the video duration exceeds MAX_YOUTUBE_DURATION_SECONDS (3600 — matches render pipeline timeout) or if metadata cannot be fetched / fetch times out. On pass, creates a Clip + ClipFile record and queues download_youtube_video. Each import creates a new ClipFile (per-user storage, no deduplication). Convention #13 applies — the duration cap is enforced against resolved video metadata, not against URL shape alone.

Presigned Upload

POST /api/v1/projects/{project_uuid}/clips/presign
{ "filename": "video.mp4", "content_type": "video/mp4" }

Returns a presigned URL. Client uploads directly to R2, then calls:

POST /api/v1/projects/{project_uuid}/clips/{clip_uuid}/confirm

This queues process_clip_artifacts.

download_youtube_video Task

Downloads from YouTube and processes the video:

  1. Download with yt-dlp - Up to 1080p, best audio
  2. Upload original - Store in R2 at clips/{prefix}/{uuid}/{filename}
  3. Extract audio - 16kHz mono MP3 for Whisper
  4. Generate waveform - Array of peaks for timeline visualization
  5. Check codecs - Determine if proxy is needed
  6. Update ClipFile - Store paths and metadata
  7. Publish ClipReadyEvent - Notify frontend

If the video codec isn't web-compatible (HEVC, ProRes), we queue generate_clip_proxy onto the separate proxy_broker (queue: proxy). Keeping proxy generation off the download queue means audio extraction for the next import doesn't have to wait behind a slow FFmpeg re-encode.

process_clip_artifacts Task

For user-uploaded videos (already in R2):

  1. Download from R2 - Fetch the uploaded file
  2. Extract audio - Same 16kHz mono MP3
  3. Generate waveform - Same peaks array
  4. Check codecs - Same compatibility check
  5. Update ClipFile - Store paths
  6. Publish ClipReadyEvent

generate_clip_proxy Task

Runs on the dedicated proxy_broker (queue: proxy), executed by the taskiq-proxy-worker container. CPU-heavy re-encodes can take 1-3× source duration, so decoupling this broker from download_broker prevents audio extraction for newly imported clips from queuing behind a long transcode.

Creates a web-compatible preview for videos that browsers can't play natively, plus a timeline scrub sprite for instant thumbnails:

  1. Download original - From R2
  2. Transcode + sprite - Chained FFmpeg single-decode pass: 480p H.264 + AAC audio (with +faststart and dense -g 60 -keyint_min 30 -sc_threshold 0 keyframes for ~2s seek granularity) plus a 10×20 grid sprite (160×90 tiles, JPEG) from the same -i
  3. Upload both - Proxy to clips/{prefix}/{uuid}/proxy.mp4, sprite to clips/{prefix}/{uuid}/sprite.jpg
  4. Update ClipFile - Set proxy_key, sprite_key, and sprite_seconds_per_tile

The proxy is used for preview in the timeline. The sprite powers instant scrub thumbnails while dragging the playhead. The original is used for final render. Sprite density (seconds_per_tile) is chosen at generation time as max(1, ceil(duration_s / 200)) — short clips get 1s/tile, long clips scale down so the sprite stays a fixed 10×20 grid. See TIER_3_SPRITE_PLAN.md for the full design.

Audio Extraction

We use FFmpeg to extract audio in Whisper-compatible format:

ffmpeg -i input.mp4 -vn -ar 16000 -ac 1 -f mp3 output.mp3
  • -ar 16000 - 16kHz sample rate (Whisper requirement)
  • -ac 1 - Mono channel
  • -f mp3 - MP3 format (good compression, fast)

Waveform Generation

The waveform is an array of peak amplitudes used for the timeline visualization:

[0.12, 0.45, 0.78, 0.32, 0.91, ...]

We generate ~100 peaks per second of video. The frontend renders these as vertical bars in the timeline.

Codec Compatibility

Web browsers can play: - Video: H.264, VP8, VP9 - Audio: AAC, MP3, Opus

If we detect non-compatible codecs (HEVC, ProRes, AV1 in some browsers), we generate a proxy for preview. The original stays intact for rendering.

Storage Keys

clips/{prefix}/{uuid}/{original_filename}   # Original video
clips/{prefix}/{uuid}/audio.mp3             # Extracted audio
clips/{prefix}/{uuid}/proxy.mp4             # Web-compatible proxy
clips/{prefix}/{uuid}/sprite.jpg            # Timeline scrub sprite (10x20 grid)
clips/{prefix}/{uuid}/waveform.json         # Peak data

The {prefix} is the first 2 characters of the UUID, which helps S3 distribute files across partitions.

Key Files

Component Location
Download task backend/src/workers/download/tasks.py:download_youtube_video
Process task backend/src/workers/download/tasks.py:process_clip_artifacts
Proxy task backend/src/workers/download/tasks.py:generate_clip_proxy
yt-dlp wrapper backend/src/workers/download/youtube.py
Audio extraction backend/src/workers/download/audio.py
Waveform generation backend/src/infrastructure/waveform.py

← Render Pipeline Analysis Pipeline →