ECAPA2 Speaker Embedding Extractor

Link to paper: ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings.

ECAPA2 is a hybrid neural network architecture and training strategy for generating robust speaker embeddings. The provided pre-trained model has an easy-to-use API to extract speaker embeddings and other hierarchical features. More information can be found in our original ECAPA2 paper.

Usage Guide

Download model

You need to install the huggingface_hub package to download the ECAPA2 model:

pip install --upgrade huggingface_hub

Or with Conda:

conda install -c conda-forge huggingface_hub

Download model:

from huggingface_hub import hf_hub_download

# automatically checks for cached file, optionally set `cache_dir` location
model_file = hf_hub_download(repo_id='Jenthe/ECAPA2', filename='ecapa2.pt', cache_dir=None)

Speaker Embedding Extraction

Extracting speaker embeddings is easy and only requires a few lines of code:

import torch
import torchaudio

ecapa2 = torch.jit.load(model_file, map_location='cpu')
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2(audio)

For faster, 16-bit half-precision CUDA inference (recommended):

import torch
import torchaudio

ecapa2 = torch.jit.load(model_file, map_location='cuda')
ecapa2.half() # optional, but results in faster inference
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2(audio)

The initial calls to the JIT-model can in some cases take a very long time because of optimization attempts of the compiler. If you have issues, the JIT-optimizer can be disabled as following:

with torch.jit.optimized_execution(False):
  embedding = ecapa2(audio)

There is no need for ecapa2.eval() or torch.no_grad(), this is done automatically.

Citation

BibTeX:

@INPROCEEDINGS{ecapa2,
  author={Jenthe Thienpondt and Kris Demuynck},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, 
  title={ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings}, 
  year={2023},
  volume={},
  number={}
}

APA:

Jenthe Thienpondt, Kris Demuynck (2023). ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Contact

Name: Jenthe Thienpondt
E-mail: jenthe.thienpondt@ugent.be