W2V2-AASIST

EER% 0.22 on ASVspoof2019_LA EER% 8.11 on ASVspoof2021_LA EER% 8.32 on ASVspoof2021_DF EER% 11.22 on InTheWild EER% 38.57 on CD-ADD EER% 46.12 on SONAR EER% 11.21 on LibriSeVoc EER% 17.28 on CFAD EER% 21.79 on CVoiceFake_small EER% 16.25 on ASVspoof5 arena tier arena rank

A wav2vec 2.0 (XLS-R 300M) + AASIST anti-spoofing model, from "Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation" (Tak, Todisco, Wang, Jung, Yamagishi & Evans, Odyssey 2022). A self-supervised XLS-R front-end is fine-tuned end-to-end with an AASIST spectro-temporal graph-attention back-end. The model takes a raw speech waveform and returns a score where higher = more bona fide.

The exact wrapper used to produce the Arena scores is in w2v2_aasist.py; the network definition is in _net.py.

Architecture

  1. wav2vec 2.0 XLS-R (300M) front-end โ€” a self-supervised transformer (fairseq Wav2Vec2Model) producing 1024-d frame features, fine-tuned end-to-end with the rest of the network.
  2. AASIST back-end โ€” the XLS-R features are projected to 128-d, max-pooled, passed through a RawNet2-style residual encoder, then heterogeneous stacking graph-attention layers (HS-GAL) over spectral and temporal sub-graphs with a learnable master node and graph pooling.
  3. The 2-logit output is read at index 1 = bona fide.

How it was trained

  • Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
  • Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
  • Output: 2-class logits; the bona-fide logit (index 1) is the score.

See the source repository for the full training and evaluation code.

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.

Dataset Split EER % Trials Skipped Notes
ASVspoof2019_LA test 0.22 71,237 0 in-domain (training data)
ASVspoof2021_LA test 8.11 181,566 0 cross-dataset generalization
ASVspoof2021_DF test 8.32 611,829 0 cross-dataset generalization
InTheWild test 11.22 31,779 0 out-of-domain (real-world deepfakes)
CD-ADD test 38.57 20,786 0 out-of-domain (modern neural-TTS)
SONAR test 46.12 3,948 0 out-of-domain (diverse deepfake sources)
LibriSeVoc test 11.21 18,487 0 out-of-domain (LibriTTS neural vocoders)
CFAD test 17.28 62,999 0 out-of-domain (Chinese fake-audio detection)
CVoiceFake_small test 21.79 138,136 0 out-of-domain (multilingual vocoded TTS)
ASVspoof5 test 16.25 680,774 0 out-of-domain (crowdsourced TTS/VC + adversarial)

The self-supervised XLS-R front-end generalizes markedly better to unseen attacks than raw-waveform baselines โ€” most strikingly on InTheWild (11.22 %) and CD-ADD (38.57 %), where lightweight CNN models degrade much further.

Usage

The checkpoint is a state_dict for the Model network defined in _net.py. Constructing the network requires the base XLS-R 300M checkpoint xlsr2_300m.pt next to the wrapper (only used to build the wav2vec 2.0 architecture; every weight is then overwritten by LA_model.pth):

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt

The input must be exactly 64,600 samples at 16 kHz mono โ€” window the waveform with pad_fixed (first 64,600 samples, tile-repeat if shorter).

import numpy as np
from w2v2_aasist import W2V2AASIST   # _net.py + w2v2_aasist.py are in this repo

m = W2V2AASIST()
m.load()                                          # loads LA_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper windows the input, runs the network, and returns logits[:, 1] (class 1 = bona fide). w2v2_aasist.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

Citation

This model / paper:

@inproceedings{tak2022automatic,
  title={Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation},
  author={Tak, Hemlata and Todisco, Massimiliano and Wang, Xin and Jung, Jee-weon and Yamagishi, Junichi and Evans, Nicholas},
  booktitle={The Speaker and Language Recognition Workshop (Odyssey 2022)},
  pages={112--119},
  year={2022}
}

AASIST back-end:

@inproceedings{jung2022aasist,
  title={{AASIST}: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks},
  author={Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas},
  booktitle={ICASSP 2022},
  pages={6367--6371},
  year={2022},
  organization={IEEE}
}

License

MIT โ€” see the source repository.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for SpeechAntiSpoofingBenchmarks/W2V2-AASIST