rawnet2-plateau — Improved RawNet2 for ASVspoof 2019 LA

Best improved RawNet2 checkpoint from a comparative study of three neural anti-spoofing architectures. Supersedes caa-speech-detection-asvspoof2019/rawnet2.

Version: rawnet2_plateau (plateau LR scheduler)

Architecture

RawNet2 operating directly on raw 16 kHz audio waveforms.

Component	Config
Input	Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s)
Sinc filterbank	128 filters, kernel length 129, linear scale, fixed (non-learnable)
Residual stack (stage 1)	2 × ResBlock, 128 channels
Residual stack (stage 2)	4 × ResBlock, 512 channels
GRU	hidden size 1024
Embedding dim	1024
Classifier	Linear(1024 → 2)
Parameters	~1.1 M

Each residual block uses pre-activation BN → LeakyReLU(0.3) → Conv → BN → LeakyReLU → Conv, MaxPool(3), and Filter-wise Feature Map Scaling (FMS).

Reference: Tak et al., "End-to-End Anti-Spoofing with RawNet2", ICASSP 2021.

Training

Hyperparameter	Value
Epochs	60 (stopped early at 43)
Batch size	32
Learning rate	1e-4, plateau scheduler (patience=3)
Weight decay	2e-4
Gradient clip norm	1.0
Early stopping patience	15
Class weights	`[8.837, 1.0]` (spoof : bonafide)

Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.

Results

Baseline to beat: EER 8.09% (LFCC+GMM).

Split	EER	tandem min t-DCF	In-the-Wild EER
Dev	0.0090%	—	—
Eval	16.79%	0.5065	38.57%

Dev EER improved dramatically from 0.0398% (baseline) to 0.0090%. Note that the eval EER of 16.79% is slightly higher than the 15.09% baseline — the plateau scheduler substantially reduces dev EER but does not improve eval generalisation further.

See learning_curves/rawnet2_baseline_vs_improved.png for the training trajectory.

Usage

Install dependencies from the source repository, then:

import torch
from src.models.rawnet2.model import RawNet2Model

config = {
    "target_samples": 64000,
    "sinc_filters": 128,
    "sinc_filter_length": 129,
    "sinc_scale": "linear",
    "learnable_sinc": False,
    "first_block_channels": 128,
    "second_block_channels": 512,
    "num_first_blocks": 2,
    "num_second_blocks": 4,
    "gru_hidden": 1024,
    "embedding_dim": 1024,
    "class_weights": [8.837, 1.0],
}

model = RawNet2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
    logits = model({"waveform": waveform})["logits"]
    probs = torch.softmax(logits, dim=-1)  # [:, 0] = spoof, [:, 1] = bonafide

Source: github.com/sebastiaoteixeira/caa-ai-generated-speech-detector

Limitations

Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
In-the-Wild EER of 38.57% indicates that features learned from A01–A06 do not transfer well to real-world conditions.
Plateau scheduling improves convergence but does not solve the generalisation gap to unseen attacks.

Citation

@inproceedings{wang2020asvspoof,
  title     = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
  author    = {Wang, Xin and others},
  booktitle = {Computer Speech \& Language},
  volume    = {64},
  year      = {2020}
}

@inproceedings{tak2021rawnet2,
  title     = {End-to-End Anti-Spoofing with {RawNet2}},
  author    = {Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony},
  booktitle = {ICASSP},
  year      = {2021}
}

@inproceedings{jung2020rawnet,
  title     = {Improved {RawNet} with Feature Map Scaling for Text-Independent Speaker Verification using Raw Waveforms},
  author    = {Jung, Jee-weon and Kim, Seung-bin and Shim, Hye-jin and Kim, Ju-ho and Yu, Ha-eun and Chung, Joon Son},
  booktitle = {Interspeech},
  year      = {2020}
}

Downloads last month: 5