Non-intrusive Distortion Suppression for Noise-Robust ASR (DS / DS4BSRNN)

Reference implementation and self-contained open reproduction of:

Wei Wang, Siyi Zhao, Yanmin Qian. Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR. ICASSP 2025.

This repository provides the method's source code, pretrained checkpoints, and the audio data, so the full training and evaluation pipeline runs out-of-the-box. The paper's headline tables are measured on a private in-house Mandarin corpus (Table I) and licensed CHiME-4 + DNS data (Table III), which cannot be redistributed. This release therefore reproduces the method and its trends on fully public data — LibriTTS (speech) + ESC-50 (noise) with a frozen Whisper ASR — so every number here is directly runnable and verifiable. The qualitative conclusions match the paper: SE enhancement degrades ASR, OA recovers it, and the adaptive DS / DS4BSRNN beat the fixed-coefficient OA baseline (the more variable the conditions, the larger the margin).

The only runtime download is the frozen Whisper ASR from the HF Hub (see Environment).

What is implemented

Decoupled DS module (ds4se/ds_module.py, Fig. 1b + Algorithm 1): band-splits the original X and enhanced X̃ complex spectrograms, runs L Time/Band-RNN groups, and a linear+sigmoid head emits per-(band, frame) coefficients S ∈ [0,1]. Algorithm 1 interpolates X̂ = S·X + (1−S)·X̃.
Coupled DS4BSRNN (ds4se/ds4bsrnn.py, Fig. 2): reuses the BSRNN SE's internal hidden representation H instead of a separate band-split — fewer parameters.
BSRNN SE backbone (ds4se/bsrnn.py): self-contained, mirrors ESPnet's BSRNNSeparator, exposes the hidden H.
Non-intrusive training (scripts/train_ds_real.py): SE and ASR are frozen; only the DS module is trained on the ASR (cross-entropy) loss; SE gradients are detached. Scheduled DS coefficients (warmup bias freeze) avoid the trivial S→0 collapse.
Frozen Whisper ASR (ds4se/asr_whisper.py): a differentiable log-mel front-end wrapping openai/whisper-base.en for the ASR training loss and WER.

Results

End-to-end pipeline: noisy → frozen BSRNN SE → {OA | DS | DS4BSRNN} → frozen Whisper → WER. Data: LibriTTS (clean) + ESC-50 (noise), simulated at varying SNR. OA (observation-adding) is the fixed-coefficient baseline X̂ = s·X + (1−s)·X̃, with s tuned on dev and applied to test.

Wide-SNR test (−15…25 dB), DS modules trained on 10k utts:

Method	Test WER (%)	Params	vs OA
Noisy	25.21	—	—
SE-enhanced (BSRNN)	26.11	—	—
+ OA (s=0.4, tuned on dev)	20.43	—	baseline
+ DS (decoupled, sub-band)	19.83	28.7K	−0.60
+ DS4BSRNN (coupled, sub-band)	19.39	10.9K	−1.04

Narrow-SNR test (−5…20 dB): OA 11.19 / DS 11.09 (margin 0.10). The DS-over-OA margin grows ~6× under wide SNR, because a fixed OA coefficient cannot track the per-utterance optimum when conditions vary, while the adaptive DS can — the paper's central point. Consistent with the paper, DS4BSRNN is the best and most lightweight (beats OA by 1.04 WER with ~⅓ the parameters of the decoupled DS).

Repository layout

ds4se/            # the method (BSRNN, DS, DS4BSRNN, STFT, Whisper wrapper, losses, data)
scripts/          # prepare_manifests, train_se, train_ds_real, eval_wer, inspect_ds_coef, smoke_test
ckpt/             # pretrained: se_bsrnn.pt, ds_subband_wide.pt, ds4bsrnn_wide.pt, ds_subband_v5.pt
data/
  clean/          # LibriTTS clips (train-clean-100 subset / dev-clean / test-clean) + transcripts
  noise/          # ESC-50 (audio/ + meta/esc50.csv)
  manifests/      # train.json (10k) / dev.json (150) / test.json (300), RELATIVE paths
run_all.sh        # one-command end-to-end recipe

Environment

pip install -r requirements.txt

Tested with PyTorch 2.9 (CUDA 12.x) on a single 80 GB GPU. The frozen ASR (openai/whisper-base.en, ~290 MB) is downloaded from the HF Hub on first use; for an offline machine, pre-fetch it (e.g. huggingface-cli download openai/whisper-base.en) and set HF_HUB_OFFLINE=1.

Reproduce the reported numbers (pretrained checkpoints, no training)

# Wide-SNR evaluation: OA tuned on dev -> applied to test; DS + DS4BSRNN
python scripts/eval_wer.py --se ckpt/se_bsrnn.pt \
    --ds ckpt/ds_subband_wide.pt --ds4 ckpt/ds4bsrnn_wide.pt \
    --snr_lo -15 --snr_hi 25

This uses only in-repo data + checkpoints and prints the table above.

Full training recipe (from scratch)

bash run_all.sh        # SE -> DS (decoupled) -> DS4BSRNN -> eval, on the bundled data

or step by step:

# (0) manifests are shipped; regenerate only if starting from a full LibriTTS:
#     python scripts/prepare_manifests.py --libritts_root <LibriTTS> \
#         --n_train 10000 --n_dev 150 --n_test 300 --relative_to .

# (1) train the BSRNN speech-enhancement front-end (SI-SDR loss)
python scripts/train_se.py --epochs 40 --num_channel 64 --num_layer 6 \
    --out ckpt/se_bsrnn.pt

# (2) non-intrusive DS training against frozen Whisper (wide SNR, unlocked adaptivity)
python scripts/train_ds_real.py --se ckpt/se_bsrnn.pt --updates 8000 \
    --warmup 200 --mode sub-band --lr 1e-2 --cosine --weight_decay 1e-5 \
    --s_tv 0.0 --bias_init 1.0 --dev_items 150 --snr_lo -15 --snr_hi 25 \
    --out ckpt/ds_subband_wide.pt
python scripts/train_ds_real.py --se ckpt/se_bsrnn.pt --coupled --updates 8000 \
    --warmup 200 --mode sub-band --lr 1e-2 --cosine --weight_decay 1e-5 \
    --s_tv 0.0 --bias_init 1.0 --dev_items 150 --snr_lo -15 --snr_hi 25 \
    --out ckpt/ds4bsrnn_wide.pt

# (3) evaluate
python scripts/eval_wer.py --se ckpt/se_bsrnn.pt \
    --ds ckpt/ds_subband_wide.pt --ds4 ckpt/ds4bsrnn_wide.pt \
    --snr_lo -15 --snr_hi 25

# (optional) inspect the learned DS coefficients (adaptivity diagnostic)
python scripts/inspect_ds_coef.py --ds ckpt/ds_subband_wide.pt

A quick self-check that exercises the core mechanisms without training:

python scripts/smoke_test.py

Open-reproduction setup vs the paper's headline tables

Aspect	Paper (Tables I / III)	This open reproduction
Data	private in-house Mandarin / licensed CHiME-4 + DNS	public LibriTTS + ESC-50 (simulated)
ASR back-end	Paraformer / large Whisper	frozen `openai/whisper-base.en`
Reported quantity	absolute CER/WER	runnable WER reproducing the same trends

A few implementation details not fully specified in the paper text were fixed explicitly here: a 23-band 16 kHz split, DS band-embedding dim N=16, and the scheduled-bias bias_init.

Data & licenses

The bundled audio is redistributed under its original licenses — see DATA_LICENSES.md. ESC-50 is CC BY-NC, so the bundled artifact as a whole is for non-commercial / academic use. The repository code is MIT-licensed (see LICENSE).

Citation

@inproceedings{wang2025advancing,
  title     = {Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR},
  author    = {Wang, Wei and Zhao, Siyi and Qian, Yanmin},
  booktitle = {ICASSP},
  year      = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track