Instructions to use audarai/Audar-Diarization-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use audarai/Audar-Diarization-V1 with NeMo:
# tag did not correspond to a valid NeMo domain.
- Notebooks
- Google Colab
- Kaggle
Audar-Diarization-V1
Real-time streaming speaker diarization for up to 8 speakers.
Audar-Diarization-V1 is a fine-tuned NVIDIA Streaming Sortformer model that performs frame-level speaker diarization in streaming mode. It achieves 22.03% macro DER across 8 standard corpora (collar=0.25s), ranking #1 on all 8 against pyannote 3.1 and stock Sortformer v2.1.
Built by Audar AI as part of the Audar Unified Speech Platform.
Key Features
- 8-speaker streaming diarization via surgical head expansion (4 → 8 speakers, +2,312 parameters)
- Arrival-Order Speaker Cache (AOSC) for identity persistence across sessions (>74 min tested)
- 1.04s algorithmic latency with RTF 0.003–0.004
- 93% frozen encoder — only 8.15M of 117.7M parameters are trainable
- Zero WavLM dependency — runs entirely on log-mel + Conformer, no external embeddings
Architecture
| Component | Details |
|---|---|
| Encoder | 17-layer FastConformer (109.55M params, frozen) |
| Refinement | 2-layer Transformer |
| Output | 8 speaker-assignment heads (SortformerModules) |
| Speaker Cache | AOSC (arrival-order, margin-based assignment) |
| Loss | Binary cross-entropy (hybrid ATS + PIL) |
| Total Parameters | 117.7M (8.15M trainable, 6.9%) |
| Head Expansion | +2,312 params (2 Linear layers: 384→8, 192→8) |
| Input | 16kHz mono audio → 80-dim log-mel spectrogram |
| Frame Rate | 12.5 Hz (80ms per frame) |
Benchmark Results
8-corpus macro DER: 22.03% (collar=0.25s, onset=0.5, dscore)
| Corpus | Audar-Diar-V1 | pyannote 3.1 | Sortformer v2.1 |
|---|---|---|---|
| AMI (Headset Mix) | 15.24 | 28.60 | 24.84 |
| AliMeeting (Far) | 18.70 | 27.38 | 25.94 |
| DiPCo | 23.77 | 30.72 | 33.80 |
| ICSI | 14.46 | 22.48 | 23.22 |
| MSDWild (Few) | 21.09 | 27.12 | 36.92 |
| MSDWild (Many) | 29.41 | 34.83 | 50.77 |
| VoxConverse | 8.55 | 12.92 | 17.06 |
| CHiME-6 | 45.00 | 53.19 | 60.97 |
| Macro Average | 22.03 | 29.66 | 34.19 |
Out-of-domain: CALLHOME 10.29% DER vs pyannote 3.1 at 18.51% (no in-domain fine-tuning).
Streaming Configuration
| Parameter | Value |
|---|---|
| Chunk length | 13 frames (1.04s) |
| Right context | 7 frames |
| FIFO length | 40 frames |
| Onset threshold | 0.5 |
| Offset threshold | 0.5 |
| Latency | 1.04s algorithmic |
| RTF | 0.003–0.004 |
Inference
Requirements
pip install nemo_toolkit[asr]>=2.6.0 torch>=2.1.0
Offline Diarization
import nemo.collections.asr as nemo_asr
# Load model
model = nemo_asr.models.SortformerEncLabelModel.restore_from(
"audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()
# Run inference on audio file
from nemo.collections.asr.parts.utils.diarization_utils import OfflineDiarWithASR
# Simple offline inference
model.diarize(
audio="meeting.wav",
batch_size=1,
num_workers=0,
onset=0.5,
offset=0.5,
pad_onset=0.1,
pad_offset=0.1,
)
Streaming Diarization
import torch
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.SortformerEncLabelModel.restore_from(
"audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()
# Configure streaming
model.cfg.streaming_config = {
"chunk_len": 13, # 1.04s chunks
"right_context": 7,
"fifo_len": 40,
"cache_update": 300,
}
# Process audio in chunks
import soundfile as sf
audio, sr = sf.read("meeting.wav")
assert sr == 16000, "Resample to 16kHz"
chunk_samples = int(1.04 * sr) # 1.04s chunks
for i in range(0, len(audio), chunk_samples):
chunk = torch.tensor(audio[i:i + chunk_samples]).unsqueeze(0).cuda()
chunk_len = torch.tensor([chunk.shape[1]]).cuda()
with torch.no_grad():
preds = model.forward(
input_signal=chunk,
input_signal_length=chunk_len,
)
# preds: [batch, frames, num_speakers] probabilities
# Apply onset/offset thresholds to get binary speaker activity
speaker_activity = (preds > 0.5).int()
Integration with Audar-RT Gateway
The model is served as part of the audar-rt gateway via a diarization sidecar:
# POST endpoint (batch)
curl -X POST http://localhost:8001/v1/audio/transcriptions \
-F "file=@meeting.wav" \
-F "diarize=true"
# WebSocket endpoint (streaming)
wscat -c "ws://localhost:8001/v1/realtime?diarize=true"
Response includes speaker-attributed segments:
{
"segments": [
{"speaker": "speaker_0", "start": 0.0, "end": 3.2, "text": "..."},
{"speaker": "speaker_1", "start": 2.8, "end": 5.1, "text": "..."}
]
}
Training
| Detail | Value |
|---|---|
| Base model | nvidia/diar_streaming_sortformer_4spk-v2.1 |
| Head expansion | 4 → 8 speakers (+2,312 params) |
| Training data | 486h (275h real far-field + 200h synthetic multi-speaker) |
| Hardware | 32× NVIDIA H100 80GB (4 nodes, InfiniBand) |
| Optimizer | AdamW, cosine schedule |
| Phase 1 (4-spk) | 10,000 steps, LR 1e-5, batch 48 |
| Phase 2 (8-spk) | 5,000 steps, LR 2e-5, 500-step warmup |
| Training time | <12 hours |
| Frozen | ConformerEncoder (109.55M params, 93%) |
| Framework | NVIDIA NeMo 2.7.3 |
Training Recipe
- Phase 1 (4-spk fine-tune): Binary cross-entropy training (0.5×ATS + 0.5×PIL loss) on 275h real far-field data with encoder frozen, 10,000 steps
- Head expansion: Clone pretrained 4-speaker slot weights into 8 slots with scaled Gaussian noise for symmetry breaking (+2,312 params)
- Phase 2 (8-spk training): Continue with 486h mixed data (real + 200h synthetic 5-8 speaker), 5,000 steps
- Threshold tuning: Onset threshold correction from 0.3 to 0.5 yields +5.44pp improvement
Limitations
- Concurrent speaker wall: Reliably handles 4–5 simultaneously active speakers; up to 8 sequential speakers tracked without degradation
- Noise sensitivity: CHiME-6 (extreme far-field noise) remains challenging at 45.00% DER
- Input requirements: 16kHz mono audio; no built-in resampling
Citation
@techreport{audar-diar-2026,
title = {{Audar-RT-Diar-v1}: Real-time Streaming Speaker Diarization
for 8 Speakers via Surgical Head Expansion and
Arrival-Order Speaker Cache},
author = {{Audar AI}},
year = {2026},
note = {arXiv preprint (forthcoming)},
}
License
Apache 2.0. See LICENSE for details.
Links
- Downloads last month
- -
Dataset used to train audarai/Audar-Diarization-V1
Paper for audarai/Audar-Diarization-V1
Evaluation results
- DER (collar=0.25s) on AMI (Headset Mix)self-reported15.240
- DER (collar=0.25s) on AliMeeting (Far)self-reported18.700
- DER (collar=0.25s) on DiPCoself-reported23.770
- DER (collar=0.25s) on ICSIself-reported14.460
- DER (collar=0.25s) on MSDWild (Few)self-reported21.090
- DER (collar=0.25s) on MSDWild (Many)self-reported29.410
- DER (collar=0.25s) on VoxConverseself-reported8.550
- DER (collar=0.25s) on CHiME-6self-reported45.000