Instructions to use mlx-community/silero-vad-v6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/silero-vad-v6 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir silero-vad-v6 mlx-community/silero-vad-v6
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Silero VAD v6 (MLX)
MLX-format weights for Silero VAD v6 (16 kHz branch), converted from the
official silero_vad PyPI package.
This is the v6 companion to mlx-community/silero-vad, which contains the older v5 weights.
Both are independent ports of different Silero release lines.
TL;DR
| This repo | mlx-community/silero-vad |
|
|---|---|---|
| Silero version | v6 (latest) | v5 (previous) |
| Source | silero_vad PyPI v6 (torch.hub snakers4/silero-vad) |
onnx-community/silero-vad |
| Branches | 16 kHz only | 16 kHz + 8 kHz |
| File size | 1.2 MB | 2.1 MB |
| Layout | vad_16k.* (mlx-audio convention) |
vad_16k.* + vad_8k.* |
| Parity vs upstream PyTorch | bit-exact (max|Ξ| = 0.0) | bit-exact (max|Ξ| = 0.0) |
The two ports differ only in which Silero checkpoint they wrap β the architecture is identical (STFT β 4Γ Conv1d+ReLU β LSTM(128) β Conv1d β Sigmoid). Quality and per-chunk latency are essentially equivalent on long-form English meeting audio (see Quality below).
Architecture
Input: audio (1, 576) = 64-sample context + 512-sample chunk
+ LSTM state (h: 1Γ128, c: 1Γ128)
Pre-process: Reflection pad right (+64) β 640 samples
Learned STFT: Conv1d(1, 258, k=256, s=128) β magnitude β (1, 4, 129)
Encoder: Conv1d(129β128) β ReLU
Conv1d(128β64, s=2) β ReLU
Conv1d(64β64, s=2) β ReLU
Conv1d(64β128) β ReLU β (1, 1, 128)
LSTM: LSTMCell(128, 128) β state carried across chunks for streaming
Decoder: ReLU β Conv1d(128β1, k=1) β Sigmoid β probability
Total parameters: ~309K, ~1.2 MB on disk
Streaming: carry (h, c) across calls; per-chunk decision in <1 ms
Tensor inventory
vad_16k.stft_conv.weight [258, 256, 1] β frozen learned DFT basis
vad_16k.conv1.weight [128, 3, 129] β encoder 0 (BN-fused reparameterized)
vad_16k.conv1.bias [128]
vad_16k.conv2.weight [64, 3, 128] β encoder 1 (stride 2)
vad_16k.conv2.bias [64]
vad_16k.conv3.weight [64, 3, 64] β encoder 2 (stride 2)
vad_16k.conv3.bias [64]
vad_16k.conv4.weight [128, 3, 64] β encoder 3
vad_16k.conv4.bias [128]
vad_16k.lstm.Wx [512, 128] β input gate weights, [4Β·H, D] gate order i,f,g,o
vad_16k.lstm.Wh [512, 128] β hidden gate weights
vad_16k.lstm.bias [512] β fused bias_ih + bias_hh
vad_16k.final_conv.weight [1, 1, 128]
vad_16k.final_conv.bias [1]
All Conv1d weights are stored in MLX channels-last layout [O, K, I].
LSTM bias is the sum of PyTorch's bias_ih + bias_hh (single-tensor MLX convention).
Files
model.safetensorsβ MLX-format weights (16 kHz branch, vad_16k.* layout). Same weights serve both inference modes.config.jsonβ model metadata + branch parametersconvert.pyβ reproducible PyTorch β MLX conversion scriptexample.pyβ 32ms streaming inference example (per-chunk decisions; live mic / streaming use cases)example_256ms.pyβ 256ms unified inference example (8 internal chunks per call with noisy-OR aggregation; faster wall time for offline ASR preprocessing)
Conversion
The bundled convert.py produces this repo's model.safetensors from the
upstream PyPI silero_vad package:
uv pip install silero-vad safetensors numpy
python convert.py --output model.safetensors
What it does:
- Loads PyTorch
state_dictviasilero_vad.load_silero_vad() - Drops the 8 kHz branch (this repo ships 16 kHz only)
- Transposes Conv1d weights
[O, I, K] β [O, K, I](MLX channels-last) - Sums LSTM
bias_ih + bias_hhβ single[4H]bias - Maps PyTorch keys to mlx-audio's
vad_16k.*convention - Saves as safetensors with metadata
LSTM gate ordering is i, f, g, o along the [4H] axis β the same in both
PyTorch and MLX, so weights pass through unchanged.
Usage
Quick start (Python + MLX)
uv pip install mlx safetensors numpy huggingface_hub
# 32ms streaming (live mic, per-chunk decision)
uv run python example.py /path/to/audio_16k_mono.wav
# 256ms unified (offline batch / ASR preprocessing β ~1.7Γ faster wall)
uv run python example_256ms.py /path/to/audio_16k_mono.wav
Both examples read weights directly from this repo via huggingface_hub. The
same model.safetensors serves both modes; only the inference loop differs.
Choosing a mode
32ms streaming (example.py) |
256ms unified (example_256ms.py) |
|
|---|---|---|
| Per-call output | 1 probability per 32 ms | 1 probability per 256 ms (8 internal chunks aggregated via noisy-OR) |
| Decision latency | 32 ms | 256 ms |
| MLX wall throughput | 1Γ | ~1.7Γ faster (fewer mx.eval barriers) |
| Best for | Live microphone, real-time gating | Offline ASR preprocessing, batch / file-based VAD |
Manual loading
import mlx.core as mx
from huggingface_hub import hf_hub_download
from safetensors.numpy import load_file
path = hf_hub_download("mlx-community/silero-vad-v6", "model.safetensors")
weights = load_file(path)
weights = {k: mx.array(v) for k, v in weights.items()}
# weights["vad_16k.stft_conv.weight"].shape == (258, 256, 1)
For the full forward pass, see example.py (β 80 lines, no external dependencies
beyond mlx, numpy, safetensors, huggingface_hub).
Streaming protocol
Per-chunk inputs (32 ms at 16 kHz):
- 64 context samples carried from the previous chunk's tail
- 512 new audio samples
- LSTM
h, cstate ([1, 128]each) carried across chunks
Output: scalar speech probability β [0, 1]. A standard threshold of 0.5 works
well on most material; tune via config.json::threshold for your use case.
Quality
Frame-level F1 against VibeVoice ASR
segment-derived speech labels on a 44-minute English meeting clip
(playback-eng-16k.wav, 83,190 chunks at 32 ms resolution, GT speech ratio 98.9%):
| Threshold | v6 (this repo) F1 | v5 (mlx-community/silero-vad) F1 | Ξ |
|---|---|---|---|
| 0.30 | 0.8656 | 0.8692 | +0.004 v5 |
| 0.40 | 0.8612 | 0.8649 | +0.004 v5 |
| 0.50 | 0.8572 | 0.8607 | +0.003 v5 |
| 0.60 | 0.8534 | 0.8561 | +0.003 v5 |
| 0.70 | 0.8492 | 0.8510 | +0.002 v5 |
At threshold 0.5: precision β 0.998 for both, recall 0.751 vs 0.757. The two versions are essentially equivalent within measurement noise on this sample. The v5 edge of ~0.4% F1 is too small relative to single-sample variance, GT labeling granularity (segment-level rather than word-level), and the high class imbalance to draw a generalised "v5 is better" conclusion.
A broader multi-domain quality comparison (clean / noisy / far-field / multilingual) would be needed for a definitive ranking.
Performance
Bit-exact parity with the upstream PyTorch JIT model (max|Ξ| = 0.0 across 83K chunks of test audio).
32ms streaming, per-chunk single-call latency (M1 Max):
| Backend | p50 / chunk |
|---|---|
MLX (CPU stream + mx.compile) |
0.75 ms |
| MLX (GPU naive, no compile) | 1.46 ms |
On newer hardware (M5 Max), per-chunk async-batched throughput drops to ~0.17 ms/chunk in CPU stream mode (β 187Γ real-time), per lucasnewman's PR701 benchmark.
256ms unified vs 32ms streaming, offline VAD on 10-min English meeting (M1 Max):
| Mode | VAD wall time | Speedup vs 32ms |
|---|---|---|
| 32ms streaming MLX | ~36 s | 1.0Γ |
| 256ms unified MLX | ~16 s | ~2.3Γ faster |
| CoreML 256ms (FluidInference, native Apple HW) | ~12 s | ~3.0Γ faster |
The 256ms unified mode is the right choice for offline ASR preprocessing;
its eight internal 32ms chunks per outer call amortise MLX dispatch overhead
through mx.eval barrier reduction. CoreML 256ms remains the absolute fastest
on Apple Silicon (ANE/BNNS-tuned), but pure-MLX 256ms closes ~50% of the gap.
License
MIT (matching the upstream Silero VAD license).
Citation
@misc{silero_vad_2024,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}}
}
Acknowledgments
- Silero Team β original model
- Apple MLX β runtime
mlx-community/silero-vadβ v5 port that established the layout convention used here
- Downloads last month
- 239
Quantized