rawnet2-plateau β€” Improved RawNet2 for ASVspoof 2019 LA

Best improved RawNet2 checkpoint from a comparative study of three neural anti-spoofing architectures. Supersedes caa-speech-detection-asvspoof2019/rawnet2.

Version: rawnet2_plateau (plateau LR scheduler)

Architecture

RawNet2 operating directly on raw 16 kHz audio waveforms.

Component Config
Input Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s)
Sinc filterbank 128 filters, kernel length 129, linear scale, fixed (non-learnable)
Residual stack (stage 1) 2 Γ— ResBlock, 128 channels
Residual stack (stage 2) 4 Γ— ResBlock, 512 channels
GRU hidden size 1024
Embedding dim 1024
Classifier Linear(1024 β†’ 2)
Parameters ~1.1 M

Each residual block uses pre-activation BN β†’ LeakyReLU(0.3) β†’ Conv β†’ BN β†’ LeakyReLU β†’ Conv, MaxPool(3), and Filter-wise Feature Map Scaling (FMS).

Reference: Tak et al., "End-to-End Anti-Spoofing with RawNet2", ICASSP 2021.

Training

Hyperparameter Value
Epochs 60 (stopped early at 43)
Batch size 32
Learning rate 1e-4, plateau scheduler (patience=3)
Weight decay 2e-4
Gradient clip norm 1.0
Early stopping patience 15
Class weights [8.837, 1.0] (spoof : bonafide)

Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.

Results

Baseline to beat: EER 8.09% (LFCC+GMM).

Split EER tandem min t-DCF In-the-Wild EER
Dev 0.0090% β€” β€”
Eval 16.79% 0.5065 38.57%

Dev EER improved dramatically from 0.0398% (baseline) to 0.0090%. Note that the eval EER of 16.79% is slightly higher than the 15.09% baseline β€” the plateau scheduler substantially reduces dev EER but does not improve eval generalisation further.

See learning_curves/rawnet2_baseline_vs_improved.png for the training trajectory.

Usage

Install dependencies from the source repository, then:

import torch
from src.models.rawnet2.model import RawNet2Model

config = {
    "target_samples": 64000,
    "sinc_filters": 128,
    "sinc_filter_length": 129,
    "sinc_scale": "linear",
    "learnable_sinc": False,
    "first_block_channels": 128,
    "second_block_channels": 512,
    "num_first_blocks": 2,
    "num_second_blocks": 4,
    "gru_hidden": 1024,
    "embedding_dim": 1024,
    "class_weights": [8.837, 1.0],
}

model = RawNet2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
    logits = model({"waveform": waveform})["logits"]
    probs = torch.softmax(logits, dim=-1)  # [:, 0] = spoof, [:, 1] = bonafide

Source: github.com/sebastiaoteixeira/caa-ai-generated-speech-detector

Limitations

  • Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
  • In-the-Wild EER of 38.57% indicates that features learned from A01–A06 do not transfer well to real-world conditions.
  • Plateau scheduling improves convergence but does not solve the generalisation gap to unseen attacks.

Citation

@inproceedings{wang2020asvspoof,
  title     = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
  author    = {Wang, Xin and others},
  booktitle = {Computer Speech \& Language},
  volume    = {64},
  year      = {2020}
}
@inproceedings{tak2021rawnet2,
  title     = {End-to-End Anti-Spoofing with {RawNet2}},
  author    = {Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony},
  booktitle = {ICASSP},
  year      = {2021}
}
@inproceedings{jung2020rawnet,
  title     = {Improved {RawNet} with Feature Map Scaling for Text-Independent Speaker Verification using Raw Waveforms},
  author    = {Jung, Jee-weon and Kim, Seung-bin and Shim, Hye-jin and Kim, Ju-ho and Yu, Ha-eun and Chung, Joon Son},
  booktitle = {Interspeech},
  year      = {2020}
}
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support