Audar-Diarization-V1

Real-time streaming speaker diarization for up to 8 speakers.

Audar-Diarization-V1 is a fine-tuned NVIDIA Streaming Sortformer model that performs frame-level speaker diarization in streaming mode. It achieves 22.03% macro DER across 8 standard corpora (collar=0.25s), ranking #1 on all 8 against pyannote 3.1 and stock Sortformer v2.1.

Built by Audar AI as part of the Audar Unified Speech Platform.

Key Features

8-speaker streaming diarization via surgical head expansion (4 → 8 speakers, +2,312 parameters)
Arrival-Order Speaker Cache (AOSC) for identity persistence across sessions (>74 min tested)
1.04s algorithmic latency with RTF 0.003–0.004
93% frozen encoder — only 8.15M of 117.7M parameters are trainable
Zero WavLM dependency — runs entirely on log-mel + Conformer, no external embeddings

Architecture

Component	Details
Encoder	17-layer FastConformer (109.55M params, frozen)
Refinement	2-layer Transformer
Output	8 speaker-assignment heads (SortformerModules)
Speaker Cache	AOSC (arrival-order, margin-based assignment)
Loss	Binary cross-entropy (hybrid ATS + PIL)
Total Parameters	117.7M (8.15M trainable, 6.9%)
Head Expansion	+2,312 params (2 Linear layers: 384→8, 192→8)
Input	16kHz mono audio → 80-dim log-mel spectrogram
Frame Rate	12.5 Hz (80ms per frame)

Benchmark Results

8-corpus macro DER: 22.03% (collar=0.25s, onset=0.5, dscore)

Corpus	Audar-Diar-V1	pyannote 3.1	Sortformer v2.1
AMI (Headset Mix)	15.24	28.60	24.84
AliMeeting (Far)	18.70	27.38	25.94
DiPCo	23.77	30.72	33.80
ICSI	14.46	22.48	23.22
MSDWild (Few)	21.09	27.12	36.92
MSDWild (Many)	29.41	34.83	50.77
VoxConverse	8.55	12.92	17.06
CHiME-6	45.00	53.19	60.97
Macro Average	22.03	29.66	34.19

Out-of-domain: CALLHOME 10.29% DER vs pyannote 3.1 at 18.51% (no in-domain fine-tuning).

Streaming Configuration

Parameter	Value
Chunk length	13 frames (1.04s)
Right context	7 frames
FIFO length	40 frames
Onset threshold	0.5
Offset threshold	0.5
Latency	1.04s algorithmic
RTF	0.003–0.004

Inference

Requirements

pip install nemo_toolkit[asr]>=2.6.0 torch>=2.1.0

Offline Diarization

import nemo.collections.asr as nemo_asr

# Load model
model = nemo_asr.models.SortformerEncLabelModel.restore_from(
    "audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()

# Run inference on audio file
from nemo.collections.asr.parts.utils.diarization_utils import OfflineDiarWithASR

# Simple offline inference
model.diarize(
    audio="meeting.wav",
    batch_size=1,
    num_workers=0,
    onset=0.5,
    offset=0.5,
    pad_onset=0.1,
    pad_offset=0.1,
)

Streaming Diarization

import torch
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.SortformerEncLabelModel.restore_from(
    "audarai/Audar-Diarization-V1/ft8_best.nemo"
)
model.eval()
model.cuda()

# Configure streaming
model.cfg.streaming_config = {
    "chunk_len": 13,       # 1.04s chunks
    "right_context": 7,
    "fifo_len": 40,
    "cache_update": 300,
}

# Process audio in chunks
import soundfile as sf

audio, sr = sf.read("meeting.wav")
assert sr == 16000, "Resample to 16kHz"

chunk_samples = int(1.04 * sr)  # 1.04s chunks
for i in range(0, len(audio), chunk_samples):
    chunk = torch.tensor(audio[i:i + chunk_samples]).unsqueeze(0).cuda()
    chunk_len = torch.tensor([chunk.shape[1]]).cuda()

    with torch.no_grad():
        preds = model.forward(
            input_signal=chunk,
            input_signal_length=chunk_len,
        )
    # preds: [batch, frames, num_speakers] probabilities
    # Apply onset/offset thresholds to get binary speaker activity
    speaker_activity = (preds > 0.5).int()

Integration with Audar-RT Gateway

The model is served as part of the audar-rt gateway via a diarization sidecar:

# POST endpoint (batch)
curl -X POST http://localhost:8001/v1/audio/transcriptions \
  -F "file=@meeting.wav" \
  -F "diarize=true"

# WebSocket endpoint (streaming)
wscat -c "ws://localhost:8001/v1/realtime?diarize=true"

Response includes speaker-attributed segments:

{
  "segments": [
    {"speaker": "speaker_0", "start": 0.0, "end": 3.2, "text": "..."},
    {"speaker": "speaker_1", "start": 2.8, "end": 5.1, "text": "..."}
  ]
}

Training

Detail	Value
Base model	nvidia/diar_streaming_sortformer_4spk-v2.1
Head expansion	4 → 8 speakers (+2,312 params)
Training data	486h (275h real far-field + 200h synthetic multi-speaker)
Hardware	32× NVIDIA H100 80GB (4 nodes, InfiniBand)
Optimizer	AdamW, cosine schedule
Phase 1 (4-spk)	10,000 steps, LR 1e-5, batch 48
Phase 2 (8-spk)	5,000 steps, LR 2e-5, 500-step warmup
Training time	<12 hours
Frozen	ConformerEncoder (109.55M params, 93%)
Framework	NVIDIA NeMo 2.7.3

Training Recipe

Phase 1 (4-spk fine-tune): Binary cross-entropy training (0.5×ATS + 0.5×PIL loss) on 275h real far-field data with encoder frozen, 10,000 steps
Head expansion: Clone pretrained 4-speaker slot weights into 8 slots with scaled Gaussian noise for symmetry breaking (+2,312 params)
Phase 2 (8-spk training): Continue with 486h mixed data (real + 200h synthetic 5-8 speaker), 5,000 steps
Threshold tuning: Onset threshold correction from 0.3 to 0.5 yields +5.44pp improvement

Limitations

Concurrent speaker wall: Reliably handles 4–5 simultaneously active speakers; up to 8 sequential speakers tracked without degradation
Noise sensitivity: CHiME-6 (extreme far-field noise) remains challenging at 45.00% DER
Input requirements: 16kHz mono audio; no built-in resampling

Citation

@techreport{audar-diar-2026,
  title   = {{Audar-RT-Diar-v1}: Real-time Streaming Speaker Diarization
             for 8 Speakers via Surgical Head Expansion and
             Arrival-Order Speaker Cache},
  author  = {{Audar AI}},
  year    = {2026},
  note    = {arXiv preprint (forthcoming)},
}

License

Apache 2.0. See LICENSE for details.

Dataset used to train audarai/Audar-Diarization-V1

Paper for audarai/Audar-Diarization-V1

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

Paper • 2409.06656 • Published Sep 10, 2024 • 1

Evaluation results

DER (collar=0.25s) on AMI (Headset Mix)
self-reported

15.240
DER (collar=0.25s) on AliMeeting (Far)
self-reported

18.700
DER (collar=0.25s) on DiPCo
self-reported

23.770
DER (collar=0.25s) on ICSI
self-reported

14.460
DER (collar=0.25s) on MSDWild (Few)
self-reported

21.090
DER (collar=0.25s) on MSDWild (Many)
self-reported

29.410
DER (collar=0.25s) on VoxConverse
self-reported

8.550
DER (collar=0.25s) on CHiME-6
self-reported

45.000

audarai
/

Audar-Diarization-V1