Duplicate from cisco-ai/pase

Browse files

Co-authored-by: Mansur Yesilbursa <myesilbursa@users.noreply.huggingface.co>

Files changed (6) hide show

.gitattributes +36 -0
DeWavLM.tar +3 -0
README.md +158 -0
Vocoder_Dual.tar +3 -0
Vocoder_L24.tar +3 -0
framework_all.png +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+framework_all.png filter=lfs diff=lfs merge=lfs -text

DeWavLM.tar ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e347a95ad01d15f0c3ec86804f1177e2b197e16012718f03fde152bb4e78b74f
+size 1261989218

README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+---
+license: apache-2.0
+pipeline_tag: audio-to-audio
+---
+# PASE: Phonologically Anchored Speech Enhancer
+PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.
+---
+## Model Details
+### Model Description
+<img src="framework_all.png" alt="High-level system design" width="80%">
+PASE contains two main components:
+- **Denoising WavLM (DeWavLM)**
+  Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
+  Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.
+- **Dual‑Stream Vocoder**
+  Reconstructs audio using DeWavLM's dual-stream representations:
+  - **Phonetic representation**: high-level linguistic structure
+  - **Acoustic representation**: speaker identity and prosody
+**Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
+**Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
+**Model type:** Generative Speech Enhancement
+**License:** Apache 2.0
+**Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)
+---
+### Model Sources
+- **Repository:** https://github.com/cisco-open/pase
+- **Paper:** https://arxiv.org/abs/2511.13300
+- **Demo:** https://xiaobin-rong.github.io/pase_demo/
+---
+## Uses
+### Direct Use
+- Enhance noisy or reverberant speech recordings
+- Improve perceptual quality and intelligibility
+- Preserve speaker identity and linguistic content
+- Supports **16 kHz mono audio**
+### Out-of-Scope Use
+- Medical, legal, or safety‑critical decisions
+- Voice conversion or identity manipulation
+- Non‑speech audio enhancement
+---
+## How to Get Started
+Refer to the repository for quick-start code and examples:
+https://github.com/cisco-open/pase
+---
+## Training Details
+### Training Data
+We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:
+- Clean speech:
+  - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
+  - [LibriTTS](https://www.openslr.org/60/)
+  - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
+- Noise:
+  - DNS5 Challenge noise resources
+- Room impulse responses:
+  - [OpenSLR26](https://www.openslr.org/26/)
+  - [OpenSLR28](https://www.openslr.org/28/)
+These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.
+### Dataset Attribution
+- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
+- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
+- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
+- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
+- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.
+All audio was resampled to 16 kHz.
+### Training Procedure
+#### Preprocessing
+- Mixtures generated dynamically
+- SNR sampled from –5 to 15 dB
+- Reverberation applied with 50% probability
+#### Training Hyperparameters
+- **DeWavLM:** 100k steps, LR 1e‑4, batch size 4
+- **Vocoder:** 200k steps, LR 2e‑4, batch size 12
+- Optimizer: AdamW with warmup + cosine decay
+- Hardware: 4 × NVIDIA RTX 4090 GPUs
+#### Speeds, Sizes, Times
+- Total parameters: ~382M
+- Inference compute: ~21.4 GMAC/s
+---
+## Evaluation
+### Testing Data
+- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split)
+- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
+### Metrics
+- DNSMOS, UTMOS
+- LPS, SpeechBERTScore (SBS)
+- Speaker Similarity (RawNet3)
+- WER (OWSM v3.1)
+### Results
+The performance of the released version compared to the paper's results:
+| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
+|:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
+| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
+| **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
+| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
+| **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
+| PASE (paper) | 3.12   | 3.09  |0.90 |0.93 |0.80    | 7.49    |
+| **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
+It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.
+Overall, PASE achieves:
+- Lowest WER among evaluated generative and discriminative baselines
+- Highest speaker similarity (SpkSim)
+- Strong perceptual quality with low hallucination rates
+- Consistent performance across noisy and reverberant conditions
+---
+## Bias, Risks, and Limitations
+- Model trained primarily on English speech; performance may degrade for other languages.
+- Very strong noise or mismatched reverberation conditions can introduce artifacts.
+- Speaker characteristics are preserved but not guaranteed perfectly.
+---
+### Recommendations
+Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.
+---
+## Citation
+If you use PASE in your research, please cite:
+```bibtex
+@article{PASE,
+    title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
+    volume={40},
+    DOI={10.1609/aaai.v40i39.40562},
+    number={39},
+    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
+    author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
+    year={2026},
+    month={Mar.},
+    pages={32826-32834}
+}
+```
+Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
+## Model Card Authorship & Contact
+- Mansur Yesilbursa: myesilbu@cisco.com

Vocoder_Dual.tar ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:400ea906dae8d1fccfc108e316925092b60836ad6320e87e5be6a7eeb7d65875
+size 266819375

Vocoder_L24.tar ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:910c14256faa1eda574fb11f770eb7efc175e08c7fa67a2f597938fd44f00d87
+size 262620164

framework_all.png ADDED Viewed

Git LFS Details

SHA256: 1d39d41e869029b111509fa5300a0d8525c4fdc3929962809c6db9a7cabc4e49
Pointer size: 131 Bytes
Size of remote file: 203 kB