Non-intrusive Distortion Suppression for Noise-Robust ASR (DS / DS4BSRNN)
Reference implementation and self-contained open reproduction of:
Wei Wang, Siyi Zhao, Yanmin Qian. Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR. ICASSP 2025.
This repository provides the method's source code, pretrained checkpoints, and the audio data, so the full training and evaluation pipeline runs out-of-the-box. The paper's headline tables are measured on a private in-house Mandarin corpus (Table I) and licensed CHiME-4 + DNS data (Table III), which cannot be redistributed. This release therefore reproduces the method and its trends on fully public data β LibriTTS (speech) + ESC-50 (noise) with a frozen Whisper ASR β so every number here is directly runnable and verifiable. The qualitative conclusions match the paper: SE enhancement degrades ASR, OA recovers it, and the adaptive DS / DS4BSRNN beat the fixed-coefficient OA baseline (the more variable the conditions, the larger the margin).
The only runtime download is the frozen Whisper ASR from the HF Hub (see Environment).
What is implemented
- Decoupled DS module (
ds4se/ds_module.py, Fig. 1b + Algorithm 1): band-splits the originalXand enhancedXΜcomplex spectrograms, runsLTime/Band-RNN groups, and a linear+sigmoid head emits per-(band, frame) coefficientsS β [0,1]. Algorithm 1 interpolatesXΜ = SΒ·X + (1βS)Β·XΜ. - Coupled DS4BSRNN (
ds4se/ds4bsrnn.py, Fig. 2): reuses the BSRNN SE's internal hidden representationHinstead of a separate band-split β fewer parameters. - BSRNN SE backbone (
ds4se/bsrnn.py): self-contained, mirrors ESPnet'sBSRNNSeparator, exposes the hiddenH. - Non-intrusive training (
scripts/train_ds_real.py): SE and ASR are frozen; only the DS module is trained on the ASR (cross-entropy) loss; SE gradients are detached. Scheduled DS coefficients (warmup bias freeze) avoid the trivialSβ0collapse. - Frozen Whisper ASR (
ds4se/asr_whisper.py): a differentiable log-mel front-end wrappingopenai/whisper-base.enfor the ASR training loss and WER.
Results
End-to-end pipeline: noisy β frozen BSRNN SE β {OA | DS | DS4BSRNN} β frozen Whisper β WER. Data: LibriTTS (clean) + ESC-50 (noise), simulated at varying
SNR. OA (observation-adding) is the fixed-coefficient baseline XΜ = sΒ·X + (1βs)Β·XΜ, with s tuned on dev and applied to test.
Wide-SNR test (β15β¦25 dB), DS modules trained on 10k utts:
| Method | Test WER (%) | Params | vs OA |
|---|---|---|---|
| Noisy | 25.21 | β | β |
| SE-enhanced (BSRNN) | 26.11 | β | β |
| + OA (s=0.4, tuned on dev) | 20.43 | β | baseline |
| + DS (decoupled, sub-band) | 19.83 | 28.7K | β0.60 |
| + DS4BSRNN (coupled, sub-band) | 19.39 | 10.9K | β1.04 |
Narrow-SNR test (β5β¦20 dB): OA 11.19 / DS 11.09 (margin 0.10). The DS-over-OA margin grows ~6Γ under wide SNR, because a fixed OA coefficient cannot track the per-utterance optimum when conditions vary, while the adaptive DS can β the paper's central point. Consistent with the paper, DS4BSRNN is the best and most lightweight (beats OA by 1.04 WER with ~β the parameters of the decoupled DS).
Repository layout
ds4se/ # the method (BSRNN, DS, DS4BSRNN, STFT, Whisper wrapper, losses, data)
scripts/ # prepare_manifests, train_se, train_ds_real, eval_wer, inspect_ds_coef, smoke_test
ckpt/ # pretrained: se_bsrnn.pt, ds_subband_wide.pt, ds4bsrnn_wide.pt, ds_subband_v5.pt
data/
clean/ # LibriTTS clips (train-clean-100 subset / dev-clean / test-clean) + transcripts
noise/ # ESC-50 (audio/ + meta/esc50.csv)
manifests/ # train.json (10k) / dev.json (150) / test.json (300), RELATIVE paths
run_all.sh # one-command end-to-end recipe
Environment
pip install -r requirements.txt
Tested with PyTorch 2.9 (CUDA 12.x) on a single 80 GB GPU. The frozen ASR
(openai/whisper-base.en, ~290 MB) is downloaded from the HF Hub on first use;
for an offline machine, pre-fetch it (e.g. huggingface-cli download openai/whisper-base.en) and set HF_HUB_OFFLINE=1.
Reproduce the reported numbers (pretrained checkpoints, no training)
# Wide-SNR evaluation: OA tuned on dev -> applied to test; DS + DS4BSRNN
python scripts/eval_wer.py --se ckpt/se_bsrnn.pt \
--ds ckpt/ds_subband_wide.pt --ds4 ckpt/ds4bsrnn_wide.pt \
--snr_lo -15 --snr_hi 25
This uses only in-repo data + checkpoints and prints the table above.
Full training recipe (from scratch)
bash run_all.sh # SE -> DS (decoupled) -> DS4BSRNN -> eval, on the bundled data
or step by step:
# (0) manifests are shipped; regenerate only if starting from a full LibriTTS:
# python scripts/prepare_manifests.py --libritts_root <LibriTTS> \
# --n_train 10000 --n_dev 150 --n_test 300 --relative_to .
# (1) train the BSRNN speech-enhancement front-end (SI-SDR loss)
python scripts/train_se.py --epochs 40 --num_channel 64 --num_layer 6 \
--out ckpt/se_bsrnn.pt
# (2) non-intrusive DS training against frozen Whisper (wide SNR, unlocked adaptivity)
python scripts/train_ds_real.py --se ckpt/se_bsrnn.pt --updates 8000 \
--warmup 200 --mode sub-band --lr 1e-2 --cosine --weight_decay 1e-5 \
--s_tv 0.0 --bias_init 1.0 --dev_items 150 --snr_lo -15 --snr_hi 25 \
--out ckpt/ds_subband_wide.pt
python scripts/train_ds_real.py --se ckpt/se_bsrnn.pt --coupled --updates 8000 \
--warmup 200 --mode sub-band --lr 1e-2 --cosine --weight_decay 1e-5 \
--s_tv 0.0 --bias_init 1.0 --dev_items 150 --snr_lo -15 --snr_hi 25 \
--out ckpt/ds4bsrnn_wide.pt
# (3) evaluate
python scripts/eval_wer.py --se ckpt/se_bsrnn.pt \
--ds ckpt/ds_subband_wide.pt --ds4 ckpt/ds4bsrnn_wide.pt \
--snr_lo -15 --snr_hi 25
# (optional) inspect the learned DS coefficients (adaptivity diagnostic)
python scripts/inspect_ds_coef.py --ds ckpt/ds_subband_wide.pt
A quick self-check that exercises the core mechanisms without training:
python scripts/smoke_test.py
Open-reproduction setup vs the paper's headline tables
| Aspect | Paper (Tables I / III) | This open reproduction |
|---|---|---|
| Data | private in-house Mandarin / licensed CHiME-4 + DNS | public LibriTTS + ESC-50 (simulated) |
| ASR back-end | Paraformer / large Whisper | frozen openai/whisper-base.en |
| Reported quantity | absolute CER/WER | runnable WER reproducing the same trends |
A few implementation details not fully specified in the paper text were fixed
explicitly here: a 23-band 16 kHz split, DS band-embedding dim N=16, and the
scheduled-bias bias_init.
Data & licenses
The bundled audio is redistributed under its original licenses β see
DATA_LICENSES.md. ESC-50 is CC BY-NC, so the bundled
artifact as a whole is for non-commercial / academic use. The repository
code is MIT-licensed (see LICENSE).
Citation
@inproceedings{wang2025advancing,
title = {Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR},
author = {Wang, Wei and Zhao, Siyi and Qian, Yanmin},
booktitle = {ICASSP},
year = {2025}
}