Echo Dia (V4)

Fine-tuned DiariZen-v2 (BUT-FIT/diarizen-wavlm-large-s80-md-v2) on a multi-domain meeting compound.

Training

Base model: BUT-FIT/diarizen-wavlm-large-s80-md-v2
Training data: 9.1 h compound (AMI 3.5h + AliMeeting 2.6h + NOTSOFAR 3.0h)
Strategy: WavLM layer 23 unfrozen, lr_wavlm=2.5e-6, lr_head=1e-4
Augmentation: SpecAugment (time + freq mask) + audio noise injection
Duration: 60 minutes on RTX A6000 (Phase 3 winner V4)
Best DER val (ES2011a, 18 min): 17.69%

Test set DER (collar=0, with overlap)

Dataset	DER strict	DER col=0.25	n_meetings
AMI test	17.34%	13.95%	2
AliMeeting test	14.14%	8.66%	5
NOTSOFAR test	13.49%	8.38%	5

Usage

import torch
from diarizen.pipelines.inference import DiariZenPipeline

# Load v2 base, then inject Echo Dia weights
pipe = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-md-v2")
sd = torch.load("pytorch_model.bin", map_location="cuda:0", weights_only=False)
pipe._segmentation.model.load_state_dict(sd, strict=False)

# Run
result = pipe("audio.wav")
for seg, _, spk in result.itertracks(yield_label=True):
    print(f"{seg.start:.1f}-{seg.end:.1f}  {spk}")

License

CC BY-NC 4.0 (inherited from base model). Non-commercial use only.

Downloads last month: 6

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support