Skip to content

Detection

This page covers how Sapari detects regions to cut from videos. There are two types of detection: silence detection (gaps between words) and false start detection (repeated/abandoned phrases).

Silence Detection

Silence detection finds pauses in speech that can be removed to tighten the video. We use two complementary methods.

Word-Gap Analysis (Primary)

The primary method analyzes gaps between words in the Whisper transcript. This is reliable because Whisper already validated there's no speech in those gaps.

# Whisper gives us word-level timing:
[
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "world", "start": 2.3, "end": 2.8},  # 1.8s gap!
]

If the gap between "Hello" and "world" exceeds the threshold, we mark it as a silence region.

The threshold is controlled by pacing_level (0-100):

Pacing Level Threshold Effect
0 3000ms Keep natural pauses up to 3 seconds
50 1650ms Balanced
100 300ms Remove almost all pauses

We also detect silence at the start and end of the video, with configurable padding so cuts don't feel too abrupt.

Audio Waveform Analysis (Secondary)

For silences that Whisper might miss (long pauses with background noise), we use FFmpeg's silencedetect filter:

ffmpeg -i audio.mp3 -af "silencedetect=n=-30dB:d=1.5" -f null -

The threshold is dynamic - we measure the audio's mean volume and set the silence threshold relative to it. This handles videos with different recording levels.

Audio-detected silences are validated against the waveform to filter out false positives (quiet speech marked as silence). We extract RMS amplitudes and reject regions with speech peaks.

Merging Results

Word-gap silences take priority since they're more reliable. Audio silences are only added if they: 1. Don't overlap with existing word-gap silences 2. Don't overlap with any transcript words

The final silences are sorted by start time and returned as SilenceRegion objects with confidence scores.

False Start Detection

False start detection uses an LLM to find patterns where the speaker started saying something, stopped, and tried again.

What We're Looking For

Common patterns:

"I think... I think we should" → Cut first "I think..."
"So the, the important thing" → Cut first "the,"
"This is X. This is X. This is X. And here's why" → Cut first two "This is X."

The key insight: keep the last instance that flows into new content, cut everything before it.

Chunked Detection

For long transcripts, we split words into overlapping chunks (~800 words each) and process them independently. Short transcripts are processed in a single pass. The prompt provides: 1. Plain text transcript (for readability) 2. Same transcript with [N] word indices (for precise cuts)

The LLM returns a list of cuts with exact word indices:

{
    "cuts": [{
        "start_word_idx": 0,
        "end_word_idx": 5,
        "removed_text": "I think... I think",
        "keeper_preview": "we should do this",
        "confidence": 0.9,
        "pattern_type": "incomplete_to_complete",
        "reason": "Incomplete phrase before complete version"
    }]
}

Pattern Types

Pattern Example What to Cut
stuttering_restart "E aí E aí E aí O que..." All repeated "E aí" except last
incomplete_to_complete "Seguinte, suponha... Seguinte, suponha que tem" Incomplete version
serial_repetition Same phrase 5 times All except the last
word_correction "taxa de desocupação... taxa de desemprego" Wrong word
phrase_refinement Speaker refines phrasing mid-sentence Earlier version
filler_hesitation "um, so, like, anyway..." Filler clusters

Validation Pass

After detection, we run a validation pass with a "judge" LLM that: 1. Reviews the proposed cuts 2. Checks if the final text reads naturally 3. Flags remaining issues (missed false starts, awkward transitions)

If the judge finds problems, a refinement step adjusts the cuts. This catches cases where the first pass was too aggressive or too conservative.

Sensitivity Control

The false_start_sensitivity parameter (0-100) controls how aggressive detection is:

  • 0: Disabled - no false start detection
  • 50: Conservative - only high-confidence patterns
  • 100: Aggressive - flag anything that looks like a restart

Higher sensitivity catches more false starts but may have more false positives.

Profanity Detection

Profanity detection uses a dictionary-based matcher to find words that should be censored in the audio. Unlike silence/false start detection which creates cut edits, profanity creates mute edits - the video keeps playing but audio is silenced or bleeped.

How It Works

  1. The transcript words are matched against a language-specific profanity dictionary
  2. Matching words are flagged with their indices
  3. Word timing from Whisper converts indices to timestamps
  4. Edit records are created with type=PROFANITY and action=MUTE

Supported languages: en, es, pt, fr. Unsupported languages skip profanity censoring with a warning.

{
    "profanity_words": [
        {"word_idx": 42, "word": "damn"},
        {"word_idx": 87, "word": "shit"}
    ]
}

Audio Censorship Modes

Users can choose how profanity is handled in exports:

Mode Effect Use Case
none No censorship Adult content platforms
mute Silence during profanity Professional/subtle
bleep 1kHz tone during profanity Traditional TV-style

The audio_censorship setting is stored in AnalysisPreset and passed through to the render pipeline.

Preview vs Export

  • Preview: Audio is muted in the browser player during profanity regions
  • Export: FFmpeg applies silence (volume=0) or bleep tone (1kHz sine wave)

Converting to Edits

After detection, regions are converted to Edit records:

Edit(
    type=EditType.SILENCE,  # or FALSE_START or PROFANITY
    action=EditAction.CUT,  # or MUTE for profanity
    start_ms=region.start_ms,
    end_ms=region.end_ms,
    active=True,
    confidence=region.confidence,
    reason=region.reason,       # Full explanation (for logs/debugging)
    reason_tag=region.reason_tag,  # Short tag for UI (e.g., "word_gap", "serial_repetition")
)

The action field determines behavior: - CUT: Remove both video and audio (silence, false starts) - MUTE: Keep video playing, silence/bleep audio only (profanity)

The reason field stores the full LLM explanation for debugging, while reason_tag is a short identifier formatted for UI display (snake_case → "Sentence case").

Overlapping edits from both detection methods are merged, preferring FALSE_START type when they overlap (more significant edits).

Key Files

Component Location
Silence detection backend/src/workers/analysis/silence/detection.py
False start detection backend/src/workers/analysis/false_starts/detection/logic.py
False start step backend/src/workers/analysis/false_starts/detection/step.py
Validation judge backend/src/workers/analysis/false_starts/validation/judge.py
Profanity detection backend/src/workers/analysis/profanity/dictionary.py
Profanity matcher backend/src/workers/analysis/profanity/matcher.py
Profanity step backend/src/workers/analysis/profanity/step.py
Constants backend/src/workers/analysis/constants.py

← Analysis Pipeline Prompts →