rawnet2-plateau β Improved RawNet2 for ASVspoof 2019 LA
Best improved RawNet2 checkpoint from a comparative study of three neural anti-spoofing architectures. Supersedes caa-speech-detection-asvspoof2019/rawnet2.
Version: rawnet2_plateau (plateau LR scheduler)
Architecture
RawNet2 operating directly on raw 16 kHz audio waveforms.
| Component | Config |
|---|---|
| Input | Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s) |
| Sinc filterbank | 128 filters, kernel length 129, linear scale, fixed (non-learnable) |
| Residual stack (stage 1) | 2 Γ ResBlock, 128 channels |
| Residual stack (stage 2) | 4 Γ ResBlock, 512 channels |
| GRU | hidden size 1024 |
| Embedding dim | 1024 |
| Classifier | Linear(1024 β 2) |
| Parameters | ~1.1 M |
Each residual block uses pre-activation BN β LeakyReLU(0.3) β Conv β BN β LeakyReLU β Conv, MaxPool(3), and Filter-wise Feature Map Scaling (FMS).
Reference: Tak et al., "End-to-End Anti-Spoofing with RawNet2", ICASSP 2021.
Training
| Hyperparameter | Value |
|---|---|
| Epochs | 60 (stopped early at 43) |
| Batch size | 32 |
| Learning rate | 1e-4, plateau scheduler (patience=3) |
| Weight decay | 2e-4 |
| Gradient clip norm | 1.0 |
| Early stopping patience | 15 |
| Class weights | [8.837, 1.0] (spoof : bonafide) |
Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.
Results
Baseline to beat: EER 8.09% (LFCC+GMM).
| Split | EER | tandem min t-DCF | In-the-Wild EER |
|---|---|---|---|
| Dev | 0.0090% | β | β |
| Eval | 16.79% | 0.5065 | 38.57% |
Dev EER improved dramatically from 0.0398% (baseline) to 0.0090%. Note that the eval EER of 16.79% is slightly higher than the 15.09% baseline β the plateau scheduler substantially reduces dev EER but does not improve eval generalisation further.
See learning_curves/rawnet2_baseline_vs_improved.png for the training trajectory.
Usage
Install dependencies from the source repository, then:
import torch
from src.models.rawnet2.model import RawNet2Model
config = {
"target_samples": 64000,
"sinc_filters": 128,
"sinc_filter_length": 129,
"sinc_scale": "linear",
"learnable_sinc": False,
"first_block_channels": 128,
"second_block_channels": 512,
"num_first_blocks": 2,
"num_second_blocks": 4,
"gru_hidden": 1024,
"embedding_dim": 1024,
"class_weights": [8.837, 1.0],
}
model = RawNet2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()
# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
logits = model({"waveform": waveform})["logits"]
probs = torch.softmax(logits, dim=-1) # [:, 0] = spoof, [:, 1] = bonafide
Source: github.com/sebastiaoteixeira/caa-ai-generated-speech-detector
Limitations
- Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
- In-the-Wild EER of 38.57% indicates that features learned from A01βA06 do not transfer well to real-world conditions.
- Plateau scheduling improves convergence but does not solve the generalisation gap to unseen attacks.
Citation
@inproceedings{wang2020asvspoof,
title = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
author = {Wang, Xin and others},
booktitle = {Computer Speech \& Language},
volume = {64},
year = {2020}
}
@inproceedings{tak2021rawnet2,
title = {End-to-End Anti-Spoofing with {RawNet2}},
author = {Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony},
booktitle = {ICASSP},
year = {2021}
}
@inproceedings{jung2020rawnet,
title = {Improved {RawNet} with Feature Map Scaling for Text-Independent Speaker Verification using Raw Waveforms},
author = {Jung, Jee-weon and Kim, Seung-bin and Shim, Hye-jin and Kim, Ju-ho and Yu, Ha-eun and Chung, Joon Son},
booktitle = {Interspeech},
year = {2020}
}
- Downloads last month
- 5