Echo Dia (V4)

Fine-tuned DiariZen-v2 (BUT-FIT/diarizen-wavlm-large-s80-md-v2) on a multi-domain meeting compound.

Training

  • Base model: BUT-FIT/diarizen-wavlm-large-s80-md-v2
  • Training data: 9.1 h compound (AMI 3.5h + AliMeeting 2.6h + NOTSOFAR 3.0h)
  • Strategy: WavLM layer 23 unfrozen, lr_wavlm=2.5e-6, lr_head=1e-4
  • Augmentation: SpecAugment (time + freq mask) + audio noise injection
  • Duration: 60 minutes on RTX A6000 (Phase 3 winner V4)
  • Best DER val (ES2011a, 18 min): 17.69%

Test set DER (collar=0, with overlap)

Dataset DER strict DER col=0.25 n_meetings
AMI test 17.34% 13.95% 2
AliMeeting test 14.14% 8.66% 5
NOTSOFAR test 13.49% 8.38% 5

Usage

import torch
from diarizen.pipelines.inference import DiariZenPipeline

# Load v2 base, then inject Echo Dia weights
pipe = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-md-v2")
sd = torch.load("pytorch_model.bin", map_location="cuda:0", weights_only=False)
pipe._segmentation.model.load_state_dict(sd, strict=False)

# Run
result = pipe("audio.wav")
for seg, _, spk in result.itertracks(yield_label=True):
    print(f"{seg.start:.1f}-{seg.end:.1f}  {spk}")

License

CC BY-NC 4.0 (inherited from base model). Non-commercial use only.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support