metadata

license: cc-by-nc-4.0

ECAPA2 Speaker Embedding and Hierarchical Feature Extractor

ECAPA2 is a hybrid neural network architecture and training strategy for generating robust speaker embeddings. The provided pre-trained model has an easy-to-use API to extract speaker embeddings and other hierarchical features. More information can be found in our original ECAPA2 paper.

The speaker embeddings are recommended for tasks which rely directly on the identity of the speaker (e.g. speaker verification and speaker diarization). The hierarchical features are most useful for tasks capturing intra-speaker variance (e.g. emotion recognition and speaker profiling) and prove complimentary with the speaker embedding in our experience. See our speaker profiling paper for an example usage of the hierarchical features.

Usage Guide

Download model

You need to install the huggingface_hub package to download the ECAPA2 model:

pip install --upgrade huggingface_hub

Or with Conda:

conda install -c conda-forge huggingface_hub

Download model:

from huggingface_hub import hf_hub_download

# automatically checks for cached file, optionally set `cache_dir` location
model_file = hf_hub_download(repo_id='Jenthe/ECAPA2', filename='model.pt', cache_dir=None)

Speaker Embedding Extraction

Extracting speaker embeddings is easy and only requires a few lines of code:

import torch
import torchaudio

ecapa2_model = torch.jit.load(model_file, map_location='cpu')
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2_model(audio)

For faster, 16-bit half-precision CUDA inference (recommended):

import torch
import torchaudio

ecapa2_model = torch.jit.load(model_file, map_location='cuda')
ecapa2_model.half() # optional, but results in faster inference
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected

embedding = ecapa2_model(audio)

There is no need for ecapa2_model.eval() or torch.no_grad(), this is done automatically.

Hierarchical Feature Extraction

For the extraction of other hierachical features, the label argument can be used, which accepts a string containing the feature ids separated with '|':

# default, only extract the embedding
feature = ecapa2_model(audio, label='embedding')

# concatenates the gfe_1, pool and embedding features
feature = ecapa2_model(audio, label='gfe_1|pool|embedding')

# returns the same output as previous example, concatenation always follows the order of the network
feature = ecapa2_model(audio, label='embedding|gfe_1|pool')

The following table describes the available features. All features consists of the mean and variance of the frame-level encodings at the indicated layer, expect for the speaker embedding.

Feature ID	Dimension	Description
gfe_1	2048	Mean and variance of frame-level features as indicated in Figure 1, extracted before ReLU and BatchNorm layer.
gfe_2	2048	Mean and variance of frame-level features as indicated in Figure 1, extracted before ReLU and BatchNorm layer.
pool	3072	Pooled statistics before the bottleneck speaker embedding layer, extracted before ReLU layer.
attention	3072	Same as the pooled statistics but with the attention weights applied.
embedding	192	The standard ECAPA2 speaker embedding.

Citation

BibTeX:

@INPROCEEDINGS{xxxxx,
  author={Jenthe Thienpondt and Kris Demuynck},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, 
  title={ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings}, 
  year={2023},
  volume={},
  number={}
}

APA:

Jenthe Thienpondt, Kris Demuynck (2023). ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Contact

Name: Jenthe Thienpondt
E-mail: jenthe.thienpondt@ugent.be