Audar-Diarization-V1

Real-time streaming speaker diarization for up to 8 speakers.

Audar-Diarization-V1 is a fine-tuned NVIDIA Streaming Sortformer model that performs frame-level speaker diarization in streaming mode. It achieves 22.03% macro DER across 8 standard corpora (collar=0.25s), ranking #1 on all 8 against pyannote 3.1 and stock Sortformer v2.1.

Built by Audar AI as part of the Audar Unified Speech Platform.

Key Features

  • 8-speaker streaming diarization via surgical head expansion (4 → 8 speakers, +2,312 parameters)
  • Arrival-Order Speaker Cache (AOSC) for identity persistence across sessions (>74 min tested)
  • 1.04s algorithmic latency with RTF 0.003–0.004
  • 93% frozen encoder — only 8.15M of 117.7M parameters are trainable
  • Zero WavLM dependency — runs entirely on log-mel + Conformer, no external embeddings

Architecture

Component Details
Encoder 17-layer FastConformer (109.55M params, frozen)
Refinement 2-layer Transformer
Output 8 speaker-assignment heads (SortformerModules)
Speaker Cache AOSC (arrival-order, margin-based assignment)
Loss Binary cross-entropy (hybrid ATS + PIL)
Total Parameters 117.7M (8.15M trainable, 6.9%)
Head Expansion +2,312 params (2 Linear layers: 384→8, 192→8)
Input 16kHz mono audio → 80-dim log-mel spectrogram
Frame Rate 12.5 Hz (80ms per frame)

Benchmark Results

8-corpus macro DER: 22.03% (collar=0.25s, onset=0.5, dscore)

Corpus Audar-Diar-V1 pyannote 3.1 Sortformer v2.1
AMI (Headset Mix) 15.24 28.60 24.84
AliMeeting (Far) 18.70 27.38 25.94
DiPCo 23.77 30.72 33.80
ICSI 14.46 22.48 23.22
MSDWild (Few) 21.09 27.12 36.92
MSDWild (Many) 29.41 34.83 50.77
VoxConverse 8.55 12.92 17.06
CHiME-6 45.00 53.19 60.97
Macro Average 22.03 29.66 34.19

Out-of-domain: CALLHOME 10.29% DER vs pyannote 3.1 at 18.51% (no in-domain fine-tuning).

Streaming Configuration

Parameter Value
Chunk length 13 frames (1.04s)
Right context 7 frames
FIFO length 40 frames
Onset threshold 0.5
Offset threshold 0.5
Latency 1.04s algorithmic
RTF 0.003–0.004

Inference

Requirements

pip install nemo_toolkit[asr]>=2.6.0 torch>=2.1.0

Offline Diarization

import nemo.collections.asr as nemo_asr

# Load model
model = nemo_asr.models.SortformerEncLabelModel.restore_from(
    "audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()

# Run inference on audio file
from nemo.collections.asr.parts.utils.diarization_utils import OfflineDiarWithASR

# Simple offline inference
model.diarize(
    audio="meeting.wav",
    batch_size=1,
    num_workers=0,
    onset=0.5,
    offset=0.5,
    pad_onset=0.1,
    pad_offset=0.1,
)

Streaming Diarization

import torch
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.SortformerEncLabelModel.restore_from(
    "audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()

# Configure streaming
model.cfg.streaming_config = {
    "chunk_len": 13,       # 1.04s chunks
    "right_context": 7,
    "fifo_len": 40,
    "cache_update": 300,
}

# Process audio in chunks
import soundfile as sf

audio, sr = sf.read("meeting.wav")
assert sr == 16000, "Resample to 16kHz"

chunk_samples = int(1.04 * sr)  # 1.04s chunks
for i in range(0, len(audio), chunk_samples):
    chunk = torch.tensor(audio[i:i + chunk_samples]).unsqueeze(0).cuda()
    chunk_len = torch.tensor([chunk.shape[1]]).cuda()

    with torch.no_grad():
        preds = model.forward(
            input_signal=chunk,
            input_signal_length=chunk_len,
        )
    # preds: [batch, frames, num_speakers] probabilities
    # Apply onset/offset thresholds to get binary speaker activity
    speaker_activity = (preds > 0.5).int()

Integration with Audar-RT Gateway

The model is served as part of the audar-rt gateway via a diarization sidecar:

# POST endpoint (batch)
curl -X POST http://localhost:8001/v1/audio/transcriptions \
  -F "file=@meeting.wav" \
  -F "diarize=true"

# WebSocket endpoint (streaming)
wscat -c "ws://localhost:8001/v1/realtime?diarize=true"

Response includes speaker-attributed segments:

{
  "segments": [
    {"speaker": "speaker_0", "start": 0.0, "end": 3.2, "text": "..."},
    {"speaker": "speaker_1", "start": 2.8, "end": 5.1, "text": "..."}
  ]
}

Training

Detail Value
Base model nvidia/diar_streaming_sortformer_4spk-v2.1
Head expansion 4 → 8 speakers (+2,312 params)
Training data 486h (275h real far-field + 200h synthetic multi-speaker)
Hardware 32× NVIDIA H100 80GB (4 nodes, InfiniBand)
Optimizer AdamW, cosine schedule
Phase 1 (4-spk) 10,000 steps, LR 1e-5, batch 48
Phase 2 (8-spk) 5,000 steps, LR 2e-5, 500-step warmup
Training time <12 hours
Frozen ConformerEncoder (109.55M params, 93%)
Framework NVIDIA NeMo 2.7.3

Training Recipe

  1. Phase 1 (4-spk fine-tune): Binary cross-entropy training (0.5×ATS + 0.5×PIL loss) on 275h real far-field data with encoder frozen, 10,000 steps
  2. Head expansion: Clone pretrained 4-speaker slot weights into 8 slots with scaled Gaussian noise for symmetry breaking (+2,312 params)
  3. Phase 2 (8-spk training): Continue with 486h mixed data (real + 200h synthetic 5-8 speaker), 5,000 steps
  4. Threshold tuning: Onset threshold correction from 0.3 to 0.5 yields +5.44pp improvement

Limitations

  • Concurrent speaker wall: Reliably handles 4–5 simultaneously active speakers; up to 8 sequential speakers tracked without degradation
  • Noise sensitivity: CHiME-6 (extreme far-field noise) remains challenging at 45.00% DER
  • Input requirements: 16kHz mono audio; no built-in resampling

Citation

@techreport{audar-diar-2026,
  title   = {{Audar-RT-Diar-v1}: Real-time Streaming Speaker Diarization
             for 8 Speakers via Surgical Head Expansion and
             Arrival-Order Speaker Cache},
  author  = {{Audar AI}},
  year    = {2026},
  note    = {arXiv preprint (forthcoming)},
}

License

Apache 2.0. See LICENSE for details.

Links

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train audarai/Audar-Diarization-V1

Paper for audarai/Audar-Diarization-V1

Evaluation results