Instructions to use laion/vocalburst-locator with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use laion/vocalburst-locator with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="laion/vocalburst-locator")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("laion/vocalburst-locator", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Vocal Burst Locator
A Whisper-based model that detects and localizes vocal bursts (laughs, coughs, sneezes, sighs, gasps, cries, screams, etc.) in audio, returning precise start/end timestamps for each event.
Model Description
This model performs binary frame-level segmentation on audio: for each 20ms frame in a 30-second audio clip, it predicts whether a vocal burst is occurring. Post-processing then groups these frame-level predictions into discrete events with timestamps and confidence scores.
Architecture
Audio (16kHz, 30s) β Whisper-small Encoder (LoRA rank-8 merged) β 1500 frame embeddings
β Linear(768β384) + GELU + Dropout
β Conv1d(384, kernel=7) + GELU + Dropout (temporal smoothing)
β Linear(384β1) β sigmoid β 1500 probabilities
β Post-processing β [(start, end, confidence), ...]
The model uses OpenAI's Whisper-small encoder as the audio feature backbone. During training, the encoder was adapted using LoRA (rank 8, alpha 16) on the q_proj and v_proj attention matrices. The LoRA weights have been merged into the base weights, so no adapter library is needed at inference time.
Performance
Evaluated on 300 held-out synthetic soundscapes with the recommended inference settings (threshold=0.65, merge_gap=0.3s, min_dur=0.5s):
| Metric | Value |
|---|---|
| Event F1 | 0.752 |
| Event Precision | 0.897 |
| Event Recall | 0.781 |
| Binary Detection Accuracy | 0.810 |
| Frame Accuracy (all) | 0.928 |
The model catches ~78% of vocal burst events with ~90% precision (9 out of 10 predicted events are real).
Quick Start
Installation
pip install torch transformers soundfile librosa huggingface_hub
Python API
from inference import load_model, detect_vocal_bursts
# Load model (auto-downloads weights from HuggingFace)
model, fe, device = load_model("cuda") # or "cpu"
# Detect vocal bursts
events = detect_vocal_bursts("audio.mp3", model=model, fe=fe, device=device)
for ev in events:
print(f"{ev['start']:.2f}s - {ev['end']:.2f}s "
f"(confidence: {ev['confidence']:.2f})")
Command Line
# Basic usage
python inference.py audio.mp3
# With custom settings
python inference.py audio.wav --threshold 0.7 --min-dur 0.3 --device cuda
# JSON output (for piping to other tools)
python inference.py audio.mp3 --json
# Use a local checkpoint instead of auto-downloading
python inference.py audio.mp3 --checkpoint ./model.pt
Expected Output
Detected 3 vocal burst(s) in audio.mp3:
1. 2.14s - 3.82s (duration: 1.68s, confidence: 0.89)
2. 8.50s - 9.12s (duration: 0.62s, confidence: 0.74)
3. 15.30s - 16.94s (duration: 1.64s, confidence: 0.92)
JSON output (--json):
{
"file": "audio.mp3",
"events": [
{"start": 2.14, "end": 3.82, "confidence": 0.89, "duration": 1.68},
{"start": 8.5, "end": 9.12, "confidence": 0.74, "duration": 0.62},
{"start": 15.3, "end": 16.94, "confidence": 0.92, "duration": 1.64}
]
}
Inference Parameters
| Parameter | Default | Description |
|---|---|---|
threshold |
0.65 | Detection confidence threshold (0-1). Higher = fewer false positives, lower = fewer missed events. |
merge_gap |
0.3 | Merge predicted segments closer than this (seconds). Prevents a single event from being split into fragments. |
min_dur |
0.5 | Discard predicted events shorter than this (seconds). Filters out spurious short false positives. |
device |
auto | "cpu", "cuda", or "cuda:0" etc. Auto-detects GPU if available. |
Understanding Precision, Recall, and the Threshold Trade-off
Imagine the model is a security guard watching for vocal bursts. It has to make a decision for every moment of audio: "Is this a vocal burst, or not?"
There are four possible outcomes:
REALITY
Vocal Burst Not a VB
βββββββββββββββ¬ββββββββββββββ
MODEL Yes β True Pos β β False Pos β β β "False alarm"
SAYS: β (correct!) β (oops) β
βββββββββββββββΌββββββββββββββ€
No β False Neg β β True Neg β β β "Missed it"
β (missed!) β (correct!) β
βββββββββββββββ΄ββββββββββββββ
Precision = Of everything the model flagged, how many were real?
TP / (TP + FP)- High precision β when the model says "vocal burst!", it's almost always right
- Low precision β lots of false alarms (the model is trigger-happy)
Recall = Of all real vocal bursts, how many did the model catch?
TP / (TP + FN)- High recall β the model rarely misses a real event
- Low recall β the model is too conservative, missing real events
F1 Score = The harmonic mean of precision and recall β balances both into one number. An F1 of 0.75 means a good balance between catching events and not crying wolf.
How Each Parameter Affects Results
threshold (default: 0.65) β The confidence cutoff
The model outputs a confidence score (0 to 1) for every 20ms frame. The threshold decides: "How confident must the model be before we call it a vocal burst?"
threshold = 0.3 (low) β Model flags almost everything
β High recall (catches most VBs)
β Low precision (many false alarms)
Think: paranoid security guard
threshold = 0.65 (default) β Balanced
β Good precision AND recall
= Our recommended sweet spot
threshold = 0.9 (high) β Model only flags when very sure
β High precision (almost no false alarms)
β Low recall (misses quieter/ambiguous VBs)
Think: lazy security guard
Real example from our validation set:
| Threshold | Precision | Recall | F1 | False Positives | Missed Events |
|---|---|---|---|---|---|
| 0.40 | 0.62 | 0.89 | 0.73 | Many | Few |
| 0.65 | 0.90 | 0.78 | 0.75 | Few | Some |
| 0.85 | 0.95 | 0.55 | 0.70 | Very few | Many |
min_dur (default: 0.5s) β Minimum event duration
After grouping confident frames into events, discard any event shorter than min_dur.
min_dur = 0.1s β Keeps very short detections
β More false positives (brief noise spikes get flagged)
β Can detect short coughs/gasps
min_dur = 0.5s β Filters out most noise spikes
β Fewer false positives
β Might miss very brief vocal bursts (<0.5s)
min_dur = 1.0s β Only keeps long events
β Very few false positives
β Misses short coughs, gasps, sneezes
Why this works: Real vocal bursts (laughs, coughs, sneezes) typically last 0.5β3 seconds. Random noise or brief audio artifacts are usually <0.3s. By requiring events to be at least 0.5s, we eliminate most false positives with minimal impact on real events.
merge_gap (default: 0.3s) β Gap tolerance for merging
If two detected segments are separated by less than merge_gap, merge them into one event.
merge_gap = 0.0s β No merging. A laugh with a brief pause becomes 2 events.
Result: Over-counting (more events than expected)
merge_gap = 0.3s β Small pauses within a single laugh/cough stay merged.
Result: Natural event boundaries
merge_gap = 1.0s β Even 1-second gaps get bridged.
Result: Separate nearby events might merge into one big event
Example: Someone laughs for 2s, pauses 0.2s to breathe, laughs again for 1s.
merge_gap=0.0β reports 2 events (fragmented)merge_gap=0.3β reports 1 event of 3.2s (correct β it's one laugh)merge_gap=1.5β might merge two separate laughs into one (over-merging)
Parameter Recipes
| Use Case | threshold | min_dur | merge_gap | What changes |
|---|---|---|---|---|
| Balanced (default) | 0.65 | 0.5 | 0.3 | Good all-around |
| High precision (no false alarms) | 0.80 | 0.7 | 0.3 | β precision, β recall |
| High recall (catch everything) | 0.45 | 0.2 | 0.5 | β recall, β precision |
| Noisy audio (music, crowds) | 0.75 | 0.6 | 0.3 | Reduces noise-triggered FPs |
| Short events (coughs, gasps) | 0.60 | 0.2 | 0.2 | Catches brief events |
| Long events only (extended laughs) | 0.65 | 1.0 | 0.5 | Ignores anything <1s |
The Precision-Recall Trade-off (Why You Can't Have Both at 100%)
This is a fundamental concept in machine learning: making the model more cautious (β precision) always means it will miss more real events (β recall), and vice versa. You can't eliminate false positives without also losing some true positives.
β More conservative More aggressive β
Precision: ββββββββββββββββββββ (goes DOWN as you lower threshold)
Recall: ββββββββββββββββββββ (goes UP as you lower threshold)
β
Sweet spot (F1 max)
Choose your trade-off based on your application:
- Automatic subtitling: Prefer high precision (don't annotate noise as laughter)
- Safety monitoring: Prefer high recall (don't miss a scream or cry for help)
- Research/counting: Use balanced F1 (minimize both types of errors)
Files
| File | Size | Description |
|---|---|---|
model.pt |
972 MB | Full model weights (Whisper encoder with merged LoRA + segmentation head) |
head_only.pt |
5.3 MB | Segmentation head weights only (use with your own Whisper-small encoder) |
inference.py |
- | Standalone inference script with CLI and Python API |
train.py |
- | Full training script (supports frozen/LoRA/fine-tuning modes) |
generate_dataset.py |
- | Synthetic training data generator |
download_sources.py |
- | Downloads source audio from HuggingFace datasets |
config.json |
- | Model configuration and training hyperparameters |
Using head_only.pt
If you already have Whisper-small loaded or want to use a different Whisper variant:
import torch
from transformers import WhisperModel
# Load your own whisper encoder
whisper = WhisperModel.from_pretrained("openai/whisper-small")
encoder_out = whisper.encoder(input_features=mel_features).last_hidden_state # [B, 1500, 768]
# Load just the segmentation head
head_sd = torch.load("head_only.pt", map_location="cpu")
# head_sd contains: proj.0.weight, proj.0.bias, temporal.0.weight, temporal.0.bias, out.weight, out.bias
# Apply: proj β permute β temporal β permute β out β squeeze β sigmoid
Training
Full Pipeline
# 1. Download source audio (~15K vocal bursts, ~13K backgrounds)
python download_sources.py
# 2. Generate synthetic soundscapes (~33K samples)
python generate_dataset.py
# 3. Train with LoRA (best configuration)
CUDA_VISIBLE_DEVICES=0 \
FREEZE_ENCODER=1 LORA_RANK=8 LORA_ALPHA=16 \
POS_WEIGHT=2 DET_THRESHOLD=0.65 POST_MERGE_GAP=0.3 POST_MIN_DUR=0.5 \
EPOCHS=15 LR=5e-4 ENCODER_LR=2e-4 \
python train.py
Training Configuration
The training script is controlled entirely via environment variables:
| Variable | Default | Description |
|---|---|---|
FREEZE_ENCODER |
0 |
Set to 1 to freeze Whisper encoder (required for LoRA) |
LORA_RANK |
0 |
LoRA rank (0=disabled, 8=recommended) |
LORA_ALPHA |
0 |
LoRA alpha (0=auto: rankΓ2) |
POS_WEIGHT |
4.0 |
BCE positive class weight (2.0 recommended for precision) |
DET_THRESHOLD |
0.5 |
Detection threshold for eval metrics |
POST_MERGE_GAP |
0.5 |
Post-processing merge gap (seconds) |
POST_MIN_DUR |
0.3 |
Post-processing min duration (seconds) |
LR |
2e-4 |
Head learning rate |
ENCODER_LR |
0 |
Encoder/LoRA learning rate (0=same as LR) |
EPOCHS |
6 |
Training epochs |
MAX_BSZ |
0 |
Max batch size cap (0=unlimited, auto-probed) |
INIT_WEIGHTS |
- | Path to checkpoint for weight initialization |
RESUME_MODE |
none |
Resume training: none, latest, or best |
DATA_DIR |
vb_dataset |
Path to training data |
OUT_DIR |
vb_output |
Output directory for checkpoints and logs |
Data Generation
The synthetic dataset generator creates audio soundscapes by mixing:
- Vocal burst sources: ~15,680 clips from HuggingFace (laughs, coughs, sneezes, etc.)
- Background sources: Music (5,000), AudioSet SFX (5,000), AudioSnippets (3,000)
- Parameters: Random background type, 0-5 VBs per clip, varied SNR, up to 30s duration
- Split: 50% positive (with VBs) / 50% negative (background only)
Each sample produces an .mp3 audio file and a .json metadata file:
{
"events": [
{"start_time": 3.21, "end_time": 4.85},
{"start_time": 12.50, "end_time": 13.10}
],
"duration_sec": 24.5,
"bg_type": "music",
"n_vocal_bursts": 2
}
Experiment Results
We compared frozen encoder, LoRA rank 2/4/8 with optimized post-processing (threshold=0.65, merge_gap=0.3s, min_dur=0.5s, pos_weight=2):
| Model | Trainable Params | Event F1 | Precision | Recall | Binary Det |
|---|---|---|---|---|---|
| Frozen encoder | 295K (0.12%) | 0.589 | 0.786 | 0.645 | 0.733 |
| LoRA rank-2 | 1.55M (0.64%) | 0.734 | 0.886 | 0.768 | 0.803 |
| LoRA rank-4 | 1.77M (0.73%) | 0.744 | 0.878 | 0.794 | 0.807 |
| LoRA rank-8 | 2.21M (0.91%) | 0.752 | 0.897 | 0.781 | 0.810 |
Key findings:
- Raising detection threshold from 0.5β0.65 and tightening post-processing doubled F1 with zero retraining
- LoRA rank-8 provided 3.15Γ improvement over the original baseline (F1: 0.239 β 0.752)
- Precision improved from 24% to 90% β false positives dropped by ~90%
- Diminishing returns above rank 8; rank 4 may be the sweet spot for cost/performance
Limitations
- 30-second maximum: The model processes 30s clips. For longer audio, segment into overlapping 30s windows.
- Synthetic training data: Trained on synthetic mixtures, not real-world recordings. Performance may vary on production audio.
- Vocal burst types: Trained primarily on laughs, coughs, sneezes, sighs, gasps, cries. May not generalize to all vocal burst types.
- Background sensitivity: Works best with music, environmental sounds, or silence backgrounds. Dense speech backgrounds may cause more false positives.
- Frame resolution: 20ms per frame (50 fps). Event boundaries are accurate to Β±20ms.
Citation
@misc{vocalburst-locator-2025,
title={Vocal Burst Locator: Whisper-based Vocal Burst Segmentation},
author={LAION},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/laion/vocalburst-locator}
}
License
Apache 2.0
- Downloads last month
- 74
Model tree for laion/vocalburst-locator
Base model
openai/whisper-small