Detection¶

This page covers how Sapari detects regions to cut from videos. There are two types of detection: silence detection (gaps between words) and false start detection (repeated/abandoned phrases).

Silence Detection¶

Silence detection finds pauses in speech that can be removed to tighten the video. We use two complementary methods.

Word-Gap Analysis (Primary)¶

The primary method analyzes gaps between words in the Whisper transcript. This is reliable because Whisper already validated there's no speech in those gaps.

# Whisper gives us word-level timing:
[
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "world", "start": 2.3, "end": 2.8},  # 1.8s gap!
]

If the gap between "Hello" and "world" exceeds the threshold, we mark it as a silence region.

The threshold is controlled by pacing_level (0-100):

Pacing Level	Threshold	Effect
0	3000ms	Keep natural pauses up to 3 seconds
50	1650ms	Balanced
100	300ms	Remove almost all pauses

We also detect silence at the start and end of the video, with configurable padding so cuts don't feel too abrupt.

Audio Waveform Analysis (Secondary)¶

For silences that Whisper might miss (long pauses with background noise), we use FFmpeg's silencedetect filter:

ffmpeg -i audio.mp3 -af "silencedetect=n=-30dB:d=1.5" -f null -

The threshold is dynamic - we measure the audio's mean volume and set the silence threshold relative to it. This handles videos with different recording levels.

Audio-detected silences are validated against the waveform to filter out false positives (quiet speech marked as silence). We extract RMS amplitudes and reject regions with speech peaks.

Merging Results¶

Word-gap silences take priority since they're more reliable. Audio silences are only added if they: 1. Don't overlap with existing word-gap silences 2. Don't overlap with any transcript words

The final silences are sorted by start time and returned as SilenceRegion objects with confidence scores.

False Start Detection¶

False start detection uses an LLM to find patterns where the speaker started saying something, stopped, and tried again.

What We're Looking For¶

Common patterns:

"I think... I think we should" → Cut first "I think..."
"So the, the important thing" → Cut first "the,"
"This is X. This is X. This is X. And here's why" → Cut first two "This is X."

The key insight: keep the last instance that flows into new content, cut everything before it.

Chunked Detection¶

For long transcripts, we split words into overlapping chunks (~800 words each) and process them independently. Short transcripts are processed in a single pass. The prompt provides: 1. Plain text transcript (for readability) 2. Same transcript with [N] word indices (for precise cuts)

The LLM returns a list of cuts with exact word indices:

{
    "cuts": [{
        "start_word_idx": 0,
        "end_word_idx": 5,
        "removed_text": "I think... I think",
        "keeper_preview": "we should do this",
        "confidence": 0.9,
        "pattern_type": "incomplete_to_complete",
        "reason": "Incomplete phrase before complete version"
    }]
}

Pattern Types¶

Pattern	Example	What to Cut
`stuttering_restart`	"E aí E aí E aí O que..."	All repeated "E aí" except last
`incomplete_to_complete`	"Seguinte, suponha... Seguinte, suponha que tem"	Incomplete version
`serial_repetition`	Same phrase 5 times	All except the last
`word_correction`	"taxa de desocupação... taxa de desemprego"	Wrong word
`phrase_refinement`	Speaker refines phrasing mid-sentence	Earlier version
`filler_hesitation`	"um, so, like, anyway..."	Filler clusters

Validation Pass¶

After detection, we run a validation pass with a "judge" LLM that: 1. Reviews the proposed cuts 2. Checks if the final text reads naturally 3. Flags remaining issues (missed false starts, awkward transitions)

If the judge finds problems, a refinement step adjusts the cuts. This catches cases where the first pass was too aggressive or too conservative.

Sensitivity Control¶

The false_start_sensitivity parameter (0-100) controls how aggressive detection is:

0: Disabled - no false start detection
50: Conservative - only high-confidence patterns
100: Aggressive - flag anything that looks like a restart

Higher sensitivity catches more false starts but may have more false positives.

Profanity Detection¶

Profanity detection uses a dictionary-based matcher to find words that should be censored in the audio. Unlike silence/false start detection which creates cut edits, profanity creates mute edits - the video keeps playing but audio is silenced or bleeped.

How It Works¶

The transcript words are matched against a language-specific profanity dictionary
Matching words are flagged with their indices
Word timing from Whisper converts indices to timestamps
Edit records are created with type=PROFANITY and action=MUTE

Supported languages: en, es, pt, fr. Unsupported languages skip profanity censoring with a warning.

{
    "profanity_words": [
        {"word_idx": 42, "word": "damn"},
        {"word_idx": 87, "word": "shit"}
    ]
}

Audio Censorship Modes¶

Users can choose how profanity is handled in exports:

Mode	Effect	Use Case
`none`	No censorship	Adult content platforms
`mute`	Silence during profanity	Professional/subtle
`bleep`	1kHz tone during profanity	Traditional TV-style

The audio_censorship setting is stored in AnalysisPreset and passed through to the render pipeline.

Preview vs Export¶

Preview: Audio is muted in the browser player during profanity regions
Export: FFmpeg applies silence (volume=0) or bleep tone (1kHz sine wave)

Converting to Edits¶

After detection, regions are converted to Edit records:

Edit(
    type=EditType.SILENCE,  # or FALSE_START or PROFANITY
    action=EditAction.CUT,  # or MUTE for profanity
    start_ms=region.start_ms,
    end_ms=region.end_ms,
    active=True,
    confidence=region.confidence,
    reason=region.reason,       # Full explanation (for logs/debugging)
    reason_tag=region.reason_tag,  # Short tag for UI (e.g., "Word gap", "multi_attempt") — frontend maps known values to i18n catalog keys at render time
)

The action field determines behavior: - CUT: Remove both video and audio (silence, false starts) - MUTE: Keep video playing, silence/bleep audio only (profanity)

The reason field stores the full LLM explanation for debugging. reason_tag is a short identifier in raw English ("Word gap", "refinement · multi_attempt", "adjusted", pattern_type.value enum string from false-start detection). The frontend's REASON_TAG_KEY_BY_RAW map in features/analysis/hooks/useTransformedEdits.ts translates known values to the analysis.edit_reason.* i18n catalog at render time; unknown values fall through to the raw string. See docs/development/frontend.md §Localizing system-generated DB strings for the rendering pattern.

Overlapping edits from both detection methods are merged, preferring FALSE_START type when they overlap (more significant edits).

Key Files¶

Component	Location
Silence detection	`backend/src/workers/analysis/silence/detection.py`
False start detection	`backend/src/workers/analysis/false_starts/detection/logic.py`
False start step	`backend/src/workers/analysis/false_starts/detection/step.py`
Validation judge	`backend/src/workers/analysis/false_starts/validation/judge.py`
Profanity detection	`backend/src/workers/analysis/profanity/dictionary.py`
Profanity matcher	`backend/src/workers/analysis/profanity/matcher.py`
Profanity step	`backend/src/workers/analysis/profanity/step.py`
Constants	`backend/src/workers/analysis/constants.py`

← Analysis Pipeline Prompts →