Vocal Burst Locator

A Whisper-based model that detects and localizes vocal bursts (laughs, coughs, sneezes, sighs, gasps, cries, screams, etc.) in audio, returning precise start/end timestamps for each event.

Model Description

This model performs binary frame-level segmentation on audio: for each 20ms frame in a 30-second audio clip, it predicts whether a vocal burst is occurring. Post-processing then groups these frame-level predictions into discrete events with timestamps and confidence scores.

Architecture

Audio (16kHz, 30s) → Whisper-small Encoder (LoRA rank-8 merged) → 1500 frame embeddings
    → Linear(768→384) + GELU + Dropout
    → Conv1d(384, kernel=7) + GELU + Dropout   (temporal smoothing)
    → Linear(384→1) → sigmoid → 1500 probabilities
    → Post-processing → [(start, end, confidence), ...]

The model uses OpenAI's Whisper-small encoder as the audio feature backbone. During training, the encoder was adapted using LoRA (rank 8, alpha 16) on the q_proj and v_proj attention matrices. The LoRA weights have been merged into the base weights, so no adapter library is needed at inference time.

Performance

Evaluated on 300 held-out synthetic soundscapes with the recommended inference settings (threshold=0.65, merge_gap=0.3s, min_dur=0.5s):

Metric	Value
Event F1	0.752
Event Precision	0.897
Event Recall	0.781
Binary Detection Accuracy	0.810
Frame Accuracy (all)	0.928

The model catches ~78% of vocal burst events with ~90% precision (9 out of 10 predicted events are real).

Quick Start

Installation

pip install torch transformers soundfile librosa huggingface_hub

Python API

from inference import load_model, detect_vocal_bursts

# Load model (auto-downloads weights from HuggingFace)
model, fe, device = load_model("cuda")  # or "cpu"

# Detect vocal bursts
events = detect_vocal_bursts("audio.mp3", model=model, fe=fe, device=device)

for ev in events:
    print(f"{ev['start']:.2f}s - {ev['end']:.2f}s  "
          f"(confidence: {ev['confidence']:.2f})")

Command Line

# Basic usage
python inference.py audio.mp3

# With custom settings
python inference.py audio.wav --threshold 0.7 --min-dur 0.3 --device cuda

# JSON output (for piping to other tools)
python inference.py audio.mp3 --json

# Use a local checkpoint instead of auto-downloading
python inference.py audio.mp3 --checkpoint ./model.pt

Expected Output

Detected 3 vocal burst(s) in audio.mp3:

  1. 2.14s - 3.82s  (duration: 1.68s, confidence: 0.89)
  2. 8.50s - 9.12s  (duration: 0.62s, confidence: 0.74)
  3. 15.30s - 16.94s  (duration: 1.64s, confidence: 0.92)

JSON output (--json):

{
  "file": "audio.mp3",
  "events": [
    {"start": 2.14, "end": 3.82, "confidence": 0.89, "duration": 1.68},
    {"start": 8.5, "end": 9.12, "confidence": 0.74, "duration": 0.62},
    {"start": 15.3, "end": 16.94, "confidence": 0.92, "duration": 1.64}
  ]
}

Inference Parameters

Parameter	Default	Description
`threshold`	0.65	Detection confidence threshold (0-1). Higher = fewer false positives, lower = fewer missed events.
`merge_gap`	0.3	Merge predicted segments closer than this (seconds). Prevents a single event from being split into fragments.
`min_dur`	0.5	Discard predicted events shorter than this (seconds). Filters out spurious short false positives.
`device`	auto	`"cpu"`, `"cuda"`, or `"cuda:0"` etc. Auto-detects GPU if available.

Understanding Precision, Recall, and the Threshold Trade-off

Imagine the model is a security guard watching for vocal bursts. It has to make a decision for every moment of audio: "Is this a vocal burst, or not?"

There are four possible outcomes:

                        REALITY
                   Vocal Burst    Not a VB
                 ┌─────────────┬─────────────┐
  MODEL    Yes   │ True Pos ✓  │ False Pos ✗ │  ← "False alarm"
  SAYS:          │ (correct!)  │ (oops)      │
                 ├─────────────┼─────────────┤
           No    │ False Neg ✗ │ True Neg ✓  │  ← "Missed it"
                 │ (missed!)   │ (correct!)  │
                 └─────────────┴─────────────┘

Precision = Of everything the model flagged, how many were real? TP / (TP + FP)
- High precision → when the model says "vocal burst!", it's almost always right
- Low precision → lots of false alarms (the model is trigger-happy)
Recall = Of all real vocal bursts, how many did the model catch? TP / (TP + FN)
- High recall → the model rarely misses a real event
- Low recall → the model is too conservative, missing real events
F1 Score = The harmonic mean of precision and recall — balances both into one number. An F1 of 0.75 means a good balance between catching events and not crying wolf.

How Each Parameter Affects Results

`threshold` (default: 0.65) — The confidence cutoff

The model outputs a confidence score (0 to 1) for every 20ms frame. The threshold decides: "How confident must the model be before we call it a vocal burst?"

threshold = 0.3 (low)     →  Model flags almost everything
                              ✓ High recall (catches most VBs)
                              ✗ Low precision (many false alarms)
                              Think: paranoid security guard

threshold = 0.65 (default) →  Balanced
                              ✓ Good precision AND recall
                              = Our recommended sweet spot

threshold = 0.9 (high)     →  Model only flags when very sure
                              ✓ High precision (almost no false alarms)
                              ✗ Low recall (misses quieter/ambiguous VBs)
                              Think: lazy security guard

Real example from our validation set:

Threshold	Precision	Recall	F1	False Positives	Missed Events
0.40	0.62	0.89	0.73	Many	Few
0.65	0.90	0.78	0.75	Few	Some
0.85	0.95	0.55	0.70	Very few	Many

`min_dur` (default: 0.5s) — Minimum event duration

After grouping confident frames into events, discard any event shorter than min_dur.

min_dur = 0.1s  →  Keeps very short detections
                   ✗ More false positives (brief noise spikes get flagged)
                   ✓ Can detect short coughs/gasps

min_dur = 0.5s  →  Filters out most noise spikes
                   ✓ Fewer false positives
                   ✗ Might miss very brief vocal bursts (<0.5s)

min_dur = 1.0s  →  Only keeps long events
                   ✓ Very few false positives
                   ✗ Misses short coughs, gasps, sneezes

Why this works: Real vocal bursts (laughs, coughs, sneezes) typically last 0.5–3 seconds. Random noise or brief audio artifacts are usually <0.3s. By requiring events to be at least 0.5s, we eliminate most false positives with minimal impact on real events.

`merge_gap` (default: 0.3s) — Gap tolerance for merging

If two detected segments are separated by less than merge_gap, merge them into one event.

merge_gap = 0.0s  →  No merging. A laugh with a brief pause becomes 2 events.
                     Result: Over-counting (more events than expected)

merge_gap = 0.3s  →  Small pauses within a single laugh/cough stay merged.
                     Result: Natural event boundaries

merge_gap = 1.0s  →  Even 1-second gaps get bridged.
                     Result: Separate nearby events might merge into one big event

Example: Someone laughs for 2s, pauses 0.2s to breathe, laughs again for 1s.

merge_gap=0.0 → reports 2 events (fragmented)
merge_gap=0.3 → reports 1 event of 3.2s (correct — it's one laugh)
merge_gap=1.5 → might merge two separate laughs into one (over-merging)

Parameter Recipes

Use Case	threshold	min_dur	merge_gap	What changes
Balanced (default)	0.65	0.5	0.3	Good all-around
High precision (no false alarms)	0.80	0.7	0.3	↑ precision, ↓ recall
High recall (catch everything)	0.45	0.2	0.5	↑ recall, ↓ precision
Noisy audio (music, crowds)	0.75	0.6	0.3	Reduces noise-triggered FPs
Short events (coughs, gasps)	0.60	0.2	0.2	Catches brief events
Long events only (extended laughs)	0.65	1.0	0.5	Ignores anything <1s

The Precision-Recall Trade-off (Why You Can't Have Both at 100%)

This is a fundamental concept in machine learning: making the model more cautious (↑ precision) always means it will miss more real events (↓ recall), and vice versa. You can't eliminate false positives without also losing some true positives.

             ← More conservative        More aggressive →

  Precision: ████████████████░░░░  (goes DOWN as you lower threshold)
  Recall:    ░░░░████████████████  (goes UP as you lower threshold)
                       ↑
               Sweet spot (F1 max)

Choose your trade-off based on your application:

Automatic subtitling: Prefer high precision (don't annotate noise as laughter)
Safety monitoring: Prefer high recall (don't miss a scream or cry for help)
Research/counting: Use balanced F1 (minimize both types of errors)

Files

File	Size	Description
`model.pt`	972 MB	Full model weights (Whisper encoder with merged LoRA + segmentation head)
`head_only.pt`	5.3 MB	Segmentation head weights only (use with your own Whisper-small encoder)
`inference.py`	-	Standalone inference script with CLI and Python API
`train.py`	-	Full training script (supports frozen/LoRA/fine-tuning modes)
`generate_dataset.py`	-	Synthetic training data generator
`download_sources.py`	-	Downloads source audio from HuggingFace datasets
`config.json`	-	Model configuration and training hyperparameters

Using `head_only.pt`

If you already have Whisper-small loaded or want to use a different Whisper variant:

import torch
from transformers import WhisperModel

# Load your own whisper encoder
whisper = WhisperModel.from_pretrained("openai/whisper-small")
encoder_out = whisper.encoder(input_features=mel_features).last_hidden_state  # [B, 1500, 768]

# Load just the segmentation head
head_sd = torch.load("head_only.pt", map_location="cpu")
# head_sd contains: proj.0.weight, proj.0.bias, temporal.0.weight, temporal.0.bias, out.weight, out.bias
# Apply: proj → permute → temporal → permute → out → squeeze → sigmoid

Training

Full Pipeline

# 1. Download source audio (~15K vocal bursts, ~13K backgrounds)
python download_sources.py

# 2. Generate synthetic soundscapes (~33K samples)
python generate_dataset.py

# 3. Train with LoRA (best configuration)
CUDA_VISIBLE_DEVICES=0 \
  FREEZE_ENCODER=1 LORA_RANK=8 LORA_ALPHA=16 \
  POS_WEIGHT=2 DET_THRESHOLD=0.65 POST_MERGE_GAP=0.3 POST_MIN_DUR=0.5 \
  EPOCHS=15 LR=5e-4 ENCODER_LR=2e-4 \
  python train.py

Training Configuration

The training script is controlled entirely via environment variables:

Variable	Default	Description
`FREEZE_ENCODER`	`0`	Set to `1` to freeze Whisper encoder (required for LoRA)
`LORA_RANK`	`0`	LoRA rank (0=disabled, 8=recommended)
`LORA_ALPHA`	`0`	LoRA alpha (0=auto: rank×2)
`POS_WEIGHT`	`4.0`	BCE positive class weight (2.0 recommended for precision)
`DET_THRESHOLD`	`0.5`	Detection threshold for eval metrics
`POST_MERGE_GAP`	`0.5`	Post-processing merge gap (seconds)
`POST_MIN_DUR`	`0.3`	Post-processing min duration (seconds)
`LR`	`2e-4`	Head learning rate
`ENCODER_LR`	`0`	Encoder/LoRA learning rate (0=same as LR)
`EPOCHS`	`6`	Training epochs
`MAX_BSZ`	`0`	Max batch size cap (0=unlimited, auto-probed)
`INIT_WEIGHTS`	-	Path to checkpoint for weight initialization
`RESUME_MODE`	`none`	Resume training: `none`, `latest`, or `best`
`DATA_DIR`	`vb_dataset`	Path to training data
`OUT_DIR`	`vb_output`	Output directory for checkpoints and logs

Data Generation

The synthetic dataset generator creates audio soundscapes by mixing:

Vocal burst sources: ~15,680 clips from HuggingFace (laughs, coughs, sneezes, etc.)
Background sources: Music (5,000), AudioSet SFX (5,000), AudioSnippets (3,000)
Parameters: Random background type, 0-5 VBs per clip, varied SNR, up to 30s duration
Split: 50% positive (with VBs) / 50% negative (background only)

Each sample produces an .mp3 audio file and a .json metadata file:

{
    "events": [
        {"start_time": 3.21, "end_time": 4.85},
        {"start_time": 12.50, "end_time": 13.10}
    ],
    "duration_sec": 24.5,
    "bg_type": "music",
    "n_vocal_bursts": 2
}

Experiment Results

We compared frozen encoder, LoRA rank 2/4/8 with optimized post-processing (threshold=0.65, merge_gap=0.3s, min_dur=0.5s, pos_weight=2):

Model	Trainable Params	Event F1	Precision	Recall	Binary Det
Frozen encoder	295K (0.12%)	0.589	0.786	0.645	0.733
LoRA rank-2	1.55M (0.64%)	0.734	0.886	0.768	0.803
LoRA rank-4	1.77M (0.73%)	0.744	0.878	0.794	0.807
LoRA rank-8	2.21M (0.91%)	0.752	0.897	0.781	0.810

Key findings:

Raising detection threshold from 0.5→0.65 and tightening post-processing doubled F1 with zero retraining
LoRA rank-8 provided 3.15× improvement over the original baseline (F1: 0.239 → 0.752)
Precision improved from 24% to 90% — false positives dropped by ~90%
Diminishing returns above rank 8; rank 4 may be the sweet spot for cost/performance

Limitations

30-second maximum: The model processes 30s clips. For longer audio, segment into overlapping 30s windows.
Synthetic training data: Trained on synthetic mixtures, not real-world recordings. Performance may vary on production audio.
Vocal burst types: Trained primarily on laughs, coughs, sneezes, sighs, gasps, cries. May not generalize to all vocal burst types.
Background sensitivity: Works best with music, environmental sounds, or silence backgrounds. Dense speech backgrounds may cause more false positives.
Frame resolution: 20ms per frame (50 fps). Event boundaries are accurate to ±20ms.

Citation

@misc{vocalburst-locator-2025,
    title={Vocal Burst Locator: Whisper-based Vocal Burst Segmentation},
    author={LAION},
    year={2025},
    publisher={HuggingFace},
    url={https://huggingface.co/laion/vocalburst-locator}
}

License

Apache 2.0

Downloads last month: 74

Model tree for laion/vocalburst-locator

Base model

openai/whisper-small

Finetuned

(3549)

this model