Commit ·
4f3cde9
0
Parent(s):
Duplicate from cisco-ai/pase
Browse filesCo-authored-by: Mansur Yesilbursa <myesilbursa@users.noreply.huggingface.co>
- .gitattributes +36 -0
- DeWavLM.tar +3 -0
- README.md +158 -0
- Vocoder_Dual.tar +3 -0
- Vocoder_L24.tar +3 -0
- framework_all.png +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
framework_all.png filter=lfs diff=lfs merge=lfs -text
|
DeWavLM.tar
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e347a95ad01d15f0c3ec86804f1177e2b197e16012718f03fde152bb4e78b74f
|
| 3 |
+
size 1261989218
|
README.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: audio-to-audio
|
| 4 |
+
---
|
| 5 |
+
# PASE: Phonologically Anchored Speech Enhancer
|
| 6 |
+
|
| 7 |
+
PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Model Details
|
| 12 |
+
|
| 13 |
+
### Model Description
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
<img src="framework_all.png" alt="High-level system design" width="80%">
|
| 17 |
+
|
| 18 |
+
PASE contains two main components:
|
| 19 |
+
|
| 20 |
+
- **Denoising WavLM (DeWavLM)**
|
| 21 |
+
Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
|
| 22 |
+
Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.
|
| 23 |
+
|
| 24 |
+
- **Dual‑Stream Vocoder**
|
| 25 |
+
Reconstructs audio using DeWavLM's dual-stream representations:
|
| 26 |
+
- **Phonetic representation**: high-level linguistic structure
|
| 27 |
+
- **Acoustic representation**: speaker identity and prosody
|
| 28 |
+
|
| 29 |
+
**Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
|
| 30 |
+
**Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
|
| 31 |
+
**Model type:** Generative Speech Enhancement
|
| 32 |
+
**License:** Apache 2.0
|
| 33 |
+
**Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
### Model Sources
|
| 39 |
+
|
| 40 |
+
- **Repository:** https://github.com/cisco-open/pase
|
| 41 |
+
- **Paper:** https://arxiv.org/abs/2511.13300
|
| 42 |
+
- **Demo:** https://xiaobin-rong.github.io/pase_demo/
|
| 43 |
+
---
|
| 44 |
+
## Uses
|
| 45 |
+
### Direct Use
|
| 46 |
+
- Enhance noisy or reverberant speech recordings
|
| 47 |
+
- Improve perceptual quality and intelligibility
|
| 48 |
+
- Preserve speaker identity and linguistic content
|
| 49 |
+
- Supports **16 kHz mono audio**
|
| 50 |
+
### Out-of-Scope Use
|
| 51 |
+
- Medical, legal, or safety‑critical decisions
|
| 52 |
+
- Voice conversion or identity manipulation
|
| 53 |
+
- Non‑speech audio enhancement
|
| 54 |
+
---
|
| 55 |
+
## How to Get Started
|
| 56 |
+
Refer to the repository for quick-start code and examples:
|
| 57 |
+
https://github.com/cisco-open/pase
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
## Training Details
|
| 61 |
+
### Training Data
|
| 62 |
+
We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:
|
| 63 |
+
|
| 64 |
+
- Clean speech:
|
| 65 |
+
- DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
|
| 66 |
+
- [LibriTTS](https://www.openslr.org/60/)
|
| 67 |
+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
|
| 68 |
+
- Noise:
|
| 69 |
+
- DNS5 Challenge noise resources
|
| 70 |
+
- Room impulse responses:
|
| 71 |
+
- [OpenSLR26](https://www.openslr.org/26/)
|
| 72 |
+
- [OpenSLR28](https://www.openslr.org/28/)
|
| 73 |
+
|
| 74 |
+
These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.
|
| 75 |
+
|
| 76 |
+
### Dataset Attribution
|
| 77 |
+
- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
|
| 78 |
+
- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
|
| 79 |
+
- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
|
| 80 |
+
- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
|
| 81 |
+
- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.
|
| 82 |
+
|
| 83 |
+
All audio was resampled to 16 kHz.
|
| 84 |
+
|
| 85 |
+
### Training Procedure
|
| 86 |
+
#### Preprocessing
|
| 87 |
+
- Mixtures generated dynamically
|
| 88 |
+
- SNR sampled from –5 to 15 dB
|
| 89 |
+
- Reverberation applied with 50% probability
|
| 90 |
+
#### Training Hyperparameters
|
| 91 |
+
- **DeWavLM:** 100k steps, LR 1e‑4, batch size 4
|
| 92 |
+
- **Vocoder:** 200k steps, LR 2e‑4, batch size 12
|
| 93 |
+
- Optimizer: AdamW with warmup + cosine decay
|
| 94 |
+
- Hardware: 4 × NVIDIA RTX 4090 GPUs
|
| 95 |
+
#### Speeds, Sizes, Times
|
| 96 |
+
- Total parameters: ~382M
|
| 97 |
+
- Inference compute: ~21.4 GMAC/s
|
| 98 |
+
---
|
| 99 |
+
## Evaluation
|
| 100 |
+
### Testing Data
|
| 101 |
+
- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split)
|
| 102 |
+
- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
|
| 103 |
+
### Metrics
|
| 104 |
+
- DNSMOS, UTMOS
|
| 105 |
+
- LPS, SpeechBERTScore (SBS)
|
| 106 |
+
- Speaker Similarity (RawNet3)
|
| 107 |
+
- WER (OWSM v3.1)
|
| 108 |
+
|
| 109 |
+
### Results
|
| 110 |
+
|
| 111 |
+
The performance of the released version compared to the paper's results:
|
| 112 |
+
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|
| 113 |
+
|:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
|
| 114 |
+
| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
|
| 115 |
+
| **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
|
| 116 |
+
| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
|
| 117 |
+
| **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
|
| 118 |
+
| PASE (paper) | 3.12 | 3.09 |0.90 |0.93 |0.80 | 7.49 |
|
| 119 |
+
| **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
|
| 120 |
+
|
| 121 |
+
It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.
|
| 122 |
+
|
| 123 |
+
Overall, PASE achieves:
|
| 124 |
+
- Lowest WER among evaluated generative and discriminative baselines
|
| 125 |
+
- Highest speaker similarity (SpkSim)
|
| 126 |
+
- Strong perceptual quality with low hallucination rates
|
| 127 |
+
- Consistent performance across noisy and reverberant conditions
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
## Bias, Risks, and Limitations
|
| 131 |
+
- Model trained primarily on English speech; performance may degrade for other languages.
|
| 132 |
+
- Very strong noise or mismatched reverberation conditions can introduce artifacts.
|
| 133 |
+
- Speaker characteristics are preserved but not guaranteed perfectly.
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
### Recommendations
|
| 137 |
+
Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
## Citation
|
| 141 |
+
If you use PASE in your research, please cite:
|
| 142 |
+
```bibtex
|
| 143 |
+
@article{PASE,
|
| 144 |
+
title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
|
| 145 |
+
volume={40},
|
| 146 |
+
DOI={10.1609/aaai.v40i39.40562},
|
| 147 |
+
number={39},
|
| 148 |
+
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
|
| 149 |
+
author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
|
| 150 |
+
year={2026},
|
| 151 |
+
month={Mar.},
|
| 152 |
+
pages={32826-32834}
|
| 153 |
+
}
|
| 154 |
+
```
|
| 155 |
+
Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
|
| 156 |
+
## Model Card Authorship & Contact
|
| 157 |
+
- Mansur Yesilbursa: myesilbu@cisco.com
|
| 158 |
+
|
Vocoder_Dual.tar
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:400ea906dae8d1fccfc108e316925092b60836ad6320e87e5be6a7eeb7d65875
|
| 3 |
+
size 266819375
|
Vocoder_L24.tar
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:910c14256faa1eda574fb11f770eb7efc175e08c7fa67a2f597938fd44f00d87
|
| 3 |
+
size 262620164
|
framework_all.png
ADDED
|
Git LFS Details
|