W2V2-AASIST
A wav2vec 2.0 (XLS-R 300M) + AASIST anti-spoofing model, from "Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation" (Tak, Todisco, Wang, Jung, Yamagishi & Evans, Odyssey 2022). A self-supervised XLS-R front-end is fine-tuned end-to-end with an AASIST spectro-temporal graph-attention back-end. The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code: https://github.com/TakHemlata/SSL_Anti-spoofing
- Paper: https://arxiv.org/abs/2202.12233
- Parameters: 317,837,800 (317.84 M)
- Checkpoint:
LA_model.pth(the LA variant)
The exact wrapper used to produce the Arena scores is in
w2v2_aasist.py; the network definition is in
_net.py.
Architecture
- wav2vec 2.0 XLS-R (300M) front-end โ a self-supervised transformer
(
fairseqWav2Vec2Model) producing 1024-d frame features, fine-tuned end-to-end with the rest of the network. - AASIST back-end โ the XLS-R features are projected to 128-d, max-pooled, passed through a RawNet2-style residual encoder, then heterogeneous stacking graph-attention layers (HS-GAL) over spectral and temporal sub-graphs with a learnable master node and graph pooling.
- The 2-logit output is read at index 1 = bona fide.
How it was trained
- Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
- Output: 2-class logits; the bona-fide logit (index 1) is the score.
See the source repository for the full training and evaluation code.
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | 0.22 | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_LA | test | 8.11 | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | 8.32 | 611,829 | 0 | cross-dataset generalization |
| InTheWild | test | 11.22 | 31,779 | 0 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | 38.57 | 20,786 | 0 | out-of-domain (modern neural-TTS) |
| SONAR | test | 46.12 | 3,948 | 0 | out-of-domain (diverse deepfake sources) |
| LibriSeVoc | test | 11.21 | 18,487 | 0 | out-of-domain (LibriTTS neural vocoders) |
| CFAD | test | 17.28 | 62,999 | 0 | out-of-domain (Chinese fake-audio detection) |
| CVoiceFake_small | test | 21.79 | 138,136 | 0 | out-of-domain (multilingual vocoded TTS) |
| ASVspoof5 | test | 16.25 | 680,774 | 0 | out-of-domain (crowdsourced TTS/VC + adversarial) |
The self-supervised XLS-R front-end generalizes markedly better to unseen attacks than raw-waveform baselines โ most strikingly on InTheWild (11.22 %) and CD-ADD (38.57 %), where lightweight CNN models degrade much further.
Usage
The checkpoint is a state_dict for the Model network defined in
_net.py. Constructing the network requires the base XLS-R 300M
checkpoint xlsr2_300m.pt next to the wrapper (only used to build the
wav2vec 2.0 architecture; every weight is then overwritten by LA_model.pth):
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
The input must be exactly 64,600 samples at 16 kHz mono โ window the waveform
with pad_fixed (first 64,600 samples, tile-repeat if shorter).
import numpy as np
from w2v2_aasist import W2V2AASIST # _net.py + w2v2_aasist.py are in this repo
m = W2V2AASIST()
m.load() # loads LA_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
Internally the wrapper windows the input, runs the network, and returns
logits[:, 1] (class 1 = bona fide). w2v2_aasist.py is the
exact speech_spoof_bench model that produced the Arena scores.txt.
Citation
This model / paper:
@inproceedings{tak2022automatic,
title={Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation},
author={Tak, Hemlata and Todisco, Massimiliano and Wang, Xin and Jung, Jee-weon and Yamagishi, Junichi and Evans, Nicholas},
booktitle={The Speaker and Language Recognition Workshop (Odyssey 2022)},
pages={112--119},
year={2022}
}
AASIST back-end:
@inproceedings{jung2022aasist,
title={{AASIST}: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks},
author={Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas},
booktitle={ICASSP 2022},
pages={6367--6371},
year={2022},
organization={IEEE}
}
License
MIT โ see the source repository.
- Downloads last month
- 17