Vocal Burst Locator

A Whisper-based model that detects and localizes vocal bursts (laughs, coughs, sneezes, sighs, gasps, cries, screams, etc.) in audio, returning precise start/end timestamps for each event.

Model Description

This model performs binary frame-level segmentation on audio: for each 20ms frame in a 30-second audio clip, it predicts whether a vocal burst is occurring. Post-processing then groups these frame-level predictions into discrete events with timestamps and confidence scores.

Architecture

Audio (16kHz, 30s) β†’ Whisper-small Encoder (LoRA rank-8 merged) β†’ 1500 frame embeddings
    β†’ Linear(768β†’384) + GELU + Dropout
    β†’ Conv1d(384, kernel=7) + GELU + Dropout   (temporal smoothing)
    β†’ Linear(384β†’1) β†’ sigmoid β†’ 1500 probabilities
    β†’ Post-processing β†’ [(start, end, confidence), ...]

The model uses OpenAI's Whisper-small encoder as the audio feature backbone. During training, the encoder was adapted using LoRA (rank 8, alpha 16) on the q_proj and v_proj attention matrices. The LoRA weights have been merged into the base weights, so no adapter library is needed at inference time.

Performance

Evaluated on 300 held-out synthetic soundscapes with the recommended inference settings (threshold=0.65, merge_gap=0.3s, min_dur=0.5s):

Metric Value
Event F1 0.752
Event Precision 0.897
Event Recall 0.781
Binary Detection Accuracy 0.810
Frame Accuracy (all) 0.928

The model catches ~78% of vocal burst events with ~90% precision (9 out of 10 predicted events are real).

Quick Start

Installation

pip install torch transformers soundfile librosa huggingface_hub

Python API

from inference import load_model, detect_vocal_bursts

# Load model (auto-downloads weights from HuggingFace)
model, fe, device = load_model("cuda")  # or "cpu"

# Detect vocal bursts
events = detect_vocal_bursts("audio.mp3", model=model, fe=fe, device=device)

for ev in events:
    print(f"{ev['start']:.2f}s - {ev['end']:.2f}s  "
          f"(confidence: {ev['confidence']:.2f})")

Command Line

# Basic usage
python inference.py audio.mp3

# With custom settings
python inference.py audio.wav --threshold 0.7 --min-dur 0.3 --device cuda

# JSON output (for piping to other tools)
python inference.py audio.mp3 --json

# Use a local checkpoint instead of auto-downloading
python inference.py audio.mp3 --checkpoint ./model.pt

Expected Output

Detected 3 vocal burst(s) in audio.mp3:

  1. 2.14s - 3.82s  (duration: 1.68s, confidence: 0.89)
  2. 8.50s - 9.12s  (duration: 0.62s, confidence: 0.74)
  3. 15.30s - 16.94s  (duration: 1.64s, confidence: 0.92)

JSON output (--json):

{
  "file": "audio.mp3",
  "events": [
    {"start": 2.14, "end": 3.82, "confidence": 0.89, "duration": 1.68},
    {"start": 8.5, "end": 9.12, "confidence": 0.74, "duration": 0.62},
    {"start": 15.3, "end": 16.94, "confidence": 0.92, "duration": 1.64}
  ]
}

Inference Parameters

Parameter Default Description
threshold 0.65 Detection confidence threshold (0-1). Higher = fewer false positives, lower = fewer missed events.
merge_gap 0.3 Merge predicted segments closer than this (seconds). Prevents a single event from being split into fragments.
min_dur 0.5 Discard predicted events shorter than this (seconds). Filters out spurious short false positives.
device auto "cpu", "cuda", or "cuda:0" etc. Auto-detects GPU if available.

Understanding Precision, Recall, and the Threshold Trade-off

Imagine the model is a security guard watching for vocal bursts. It has to make a decision for every moment of audio: "Is this a vocal burst, or not?"

There are four possible outcomes:

                        REALITY
                   Vocal Burst    Not a VB
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  MODEL    Yes   β”‚ True Pos βœ“  β”‚ False Pos βœ— β”‚  ← "False alarm"
  SAYS:          β”‚ (correct!)  β”‚ (oops)      β”‚
                 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
           No    β”‚ False Neg βœ— β”‚ True Neg βœ“  β”‚  ← "Missed it"
                 β”‚ (missed!)   β”‚ (correct!)  β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Precision = Of everything the model flagged, how many were real? TP / (TP + FP)

    • High precision β†’ when the model says "vocal burst!", it's almost always right
    • Low precision β†’ lots of false alarms (the model is trigger-happy)
  • Recall = Of all real vocal bursts, how many did the model catch? TP / (TP + FN)

    • High recall β†’ the model rarely misses a real event
    • Low recall β†’ the model is too conservative, missing real events
  • F1 Score = The harmonic mean of precision and recall β€” balances both into one number. An F1 of 0.75 means a good balance between catching events and not crying wolf.

How Each Parameter Affects Results

threshold (default: 0.65) β€” The confidence cutoff

The model outputs a confidence score (0 to 1) for every 20ms frame. The threshold decides: "How confident must the model be before we call it a vocal burst?"

threshold = 0.3 (low)     β†’  Model flags almost everything
                              βœ“ High recall (catches most VBs)
                              βœ— Low precision (many false alarms)
                              Think: paranoid security guard

threshold = 0.65 (default) β†’  Balanced
                              βœ“ Good precision AND recall
                              = Our recommended sweet spot

threshold = 0.9 (high)     β†’  Model only flags when very sure
                              βœ“ High precision (almost no false alarms)
                              βœ— Low recall (misses quieter/ambiguous VBs)
                              Think: lazy security guard

Real example from our validation set:

Threshold Precision Recall F1 False Positives Missed Events
0.40 0.62 0.89 0.73 Many Few
0.65 0.90 0.78 0.75 Few Some
0.85 0.95 0.55 0.70 Very few Many

min_dur (default: 0.5s) β€” Minimum event duration

After grouping confident frames into events, discard any event shorter than min_dur.

min_dur = 0.1s  β†’  Keeps very short detections
                   βœ— More false positives (brief noise spikes get flagged)
                   βœ“ Can detect short coughs/gasps

min_dur = 0.5s  β†’  Filters out most noise spikes
                   βœ“ Fewer false positives
                   βœ— Might miss very brief vocal bursts (<0.5s)

min_dur = 1.0s  β†’  Only keeps long events
                   βœ“ Very few false positives
                   βœ— Misses short coughs, gasps, sneezes

Why this works: Real vocal bursts (laughs, coughs, sneezes) typically last 0.5–3 seconds. Random noise or brief audio artifacts are usually <0.3s. By requiring events to be at least 0.5s, we eliminate most false positives with minimal impact on real events.

merge_gap (default: 0.3s) β€” Gap tolerance for merging

If two detected segments are separated by less than merge_gap, merge them into one event.

merge_gap = 0.0s  β†’  No merging. A laugh with a brief pause becomes 2 events.
                     Result: Over-counting (more events than expected)

merge_gap = 0.3s  β†’  Small pauses within a single laugh/cough stay merged.
                     Result: Natural event boundaries

merge_gap = 1.0s  β†’  Even 1-second gaps get bridged.
                     Result: Separate nearby events might merge into one big event

Example: Someone laughs for 2s, pauses 0.2s to breathe, laughs again for 1s.

  • merge_gap=0.0 β†’ reports 2 events (fragmented)
  • merge_gap=0.3 β†’ reports 1 event of 3.2s (correct β€” it's one laugh)
  • merge_gap=1.5 β†’ might merge two separate laughs into one (over-merging)

Parameter Recipes

Use Case threshold min_dur merge_gap What changes
Balanced (default) 0.65 0.5 0.3 Good all-around
High precision (no false alarms) 0.80 0.7 0.3 ↑ precision, ↓ recall
High recall (catch everything) 0.45 0.2 0.5 ↑ recall, ↓ precision
Noisy audio (music, crowds) 0.75 0.6 0.3 Reduces noise-triggered FPs
Short events (coughs, gasps) 0.60 0.2 0.2 Catches brief events
Long events only (extended laughs) 0.65 1.0 0.5 Ignores anything <1s

The Precision-Recall Trade-off (Why You Can't Have Both at 100%)

This is a fundamental concept in machine learning: making the model more cautious (↑ precision) always means it will miss more real events (↓ recall), and vice versa. You can't eliminate false positives without also losing some true positives.

             ← More conservative        More aggressive β†’

  Precision: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  (goes DOWN as you lower threshold)
  Recall:    β–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  (goes UP as you lower threshold)
                       ↑
               Sweet spot (F1 max)

Choose your trade-off based on your application:

  • Automatic subtitling: Prefer high precision (don't annotate noise as laughter)
  • Safety monitoring: Prefer high recall (don't miss a scream or cry for help)
  • Research/counting: Use balanced F1 (minimize both types of errors)

Files

File Size Description
model.pt 972 MB Full model weights (Whisper encoder with merged LoRA + segmentation head)
head_only.pt 5.3 MB Segmentation head weights only (use with your own Whisper-small encoder)
inference.py - Standalone inference script with CLI and Python API
train.py - Full training script (supports frozen/LoRA/fine-tuning modes)
generate_dataset.py - Synthetic training data generator
download_sources.py - Downloads source audio from HuggingFace datasets
config.json - Model configuration and training hyperparameters

Using head_only.pt

If you already have Whisper-small loaded or want to use a different Whisper variant:

import torch
from transformers import WhisperModel

# Load your own whisper encoder
whisper = WhisperModel.from_pretrained("openai/whisper-small")
encoder_out = whisper.encoder(input_features=mel_features).last_hidden_state  # [B, 1500, 768]

# Load just the segmentation head
head_sd = torch.load("head_only.pt", map_location="cpu")
# head_sd contains: proj.0.weight, proj.0.bias, temporal.0.weight, temporal.0.bias, out.weight, out.bias
# Apply: proj β†’ permute β†’ temporal β†’ permute β†’ out β†’ squeeze β†’ sigmoid

Training

Full Pipeline

# 1. Download source audio (~15K vocal bursts, ~13K backgrounds)
python download_sources.py

# 2. Generate synthetic soundscapes (~33K samples)
python generate_dataset.py

# 3. Train with LoRA (best configuration)
CUDA_VISIBLE_DEVICES=0 \
  FREEZE_ENCODER=1 LORA_RANK=8 LORA_ALPHA=16 \
  POS_WEIGHT=2 DET_THRESHOLD=0.65 POST_MERGE_GAP=0.3 POST_MIN_DUR=0.5 \
  EPOCHS=15 LR=5e-4 ENCODER_LR=2e-4 \
  python train.py

Training Configuration

The training script is controlled entirely via environment variables:

Variable Default Description
FREEZE_ENCODER 0 Set to 1 to freeze Whisper encoder (required for LoRA)
LORA_RANK 0 LoRA rank (0=disabled, 8=recommended)
LORA_ALPHA 0 LoRA alpha (0=auto: rankΓ—2)
POS_WEIGHT 4.0 BCE positive class weight (2.0 recommended for precision)
DET_THRESHOLD 0.5 Detection threshold for eval metrics
POST_MERGE_GAP 0.5 Post-processing merge gap (seconds)
POST_MIN_DUR 0.3 Post-processing min duration (seconds)
LR 2e-4 Head learning rate
ENCODER_LR 0 Encoder/LoRA learning rate (0=same as LR)
EPOCHS 6 Training epochs
MAX_BSZ 0 Max batch size cap (0=unlimited, auto-probed)
INIT_WEIGHTS - Path to checkpoint for weight initialization
RESUME_MODE none Resume training: none, latest, or best
DATA_DIR vb_dataset Path to training data
OUT_DIR vb_output Output directory for checkpoints and logs

Data Generation

The synthetic dataset generator creates audio soundscapes by mixing:

  • Vocal burst sources: ~15,680 clips from HuggingFace (laughs, coughs, sneezes, etc.)
  • Background sources: Music (5,000), AudioSet SFX (5,000), AudioSnippets (3,000)
  • Parameters: Random background type, 0-5 VBs per clip, varied SNR, up to 30s duration
  • Split: 50% positive (with VBs) / 50% negative (background only)

Each sample produces an .mp3 audio file and a .json metadata file:

{
    "events": [
        {"start_time": 3.21, "end_time": 4.85},
        {"start_time": 12.50, "end_time": 13.10}
    ],
    "duration_sec": 24.5,
    "bg_type": "music",
    "n_vocal_bursts": 2
}

Experiment Results

We compared frozen encoder, LoRA rank 2/4/8 with optimized post-processing (threshold=0.65, merge_gap=0.3s, min_dur=0.5s, pos_weight=2):

Model Trainable Params Event F1 Precision Recall Binary Det
Frozen encoder 295K (0.12%) 0.589 0.786 0.645 0.733
LoRA rank-2 1.55M (0.64%) 0.734 0.886 0.768 0.803
LoRA rank-4 1.77M (0.73%) 0.744 0.878 0.794 0.807
LoRA rank-8 2.21M (0.91%) 0.752 0.897 0.781 0.810

Key findings:

  • Raising detection threshold from 0.5β†’0.65 and tightening post-processing doubled F1 with zero retraining
  • LoRA rank-8 provided 3.15Γ— improvement over the original baseline (F1: 0.239 β†’ 0.752)
  • Precision improved from 24% to 90% β€” false positives dropped by ~90%
  • Diminishing returns above rank 8; rank 4 may be the sweet spot for cost/performance

Limitations

  • 30-second maximum: The model processes 30s clips. For longer audio, segment into overlapping 30s windows.
  • Synthetic training data: Trained on synthetic mixtures, not real-world recordings. Performance may vary on production audio.
  • Vocal burst types: Trained primarily on laughs, coughs, sneezes, sighs, gasps, cries. May not generalize to all vocal burst types.
  • Background sensitivity: Works best with music, environmental sounds, or silence backgrounds. Dense speech backgrounds may cause more false positives.
  • Frame resolution: 20ms per frame (50 fps). Event boundaries are accurate to Β±20ms.

Citation

@misc{vocalburst-locator-2025,
    title={Vocal Burst Locator: Whisper-based Vocal Burst Segmentation},
    author={LAION},
    year={2025},
    publisher={HuggingFace},
    url={https://huggingface.co/laion/vocalburst-locator}
}

License

Apache 2.0

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for laion/vocalburst-locator

Finetuned
(3549)
this model