Xiaobin-Rong myesilbursa commited on
Commit
4f3cde9
·
0 Parent(s):

Duplicate from cisco-ai/pase

Browse files

Co-authored-by: Mansur Yesilbursa <myesilbursa@users.noreply.huggingface.co>

Files changed (6) hide show
  1. .gitattributes +36 -0
  2. DeWavLM.tar +3 -0
  3. README.md +158 -0
  4. Vocoder_Dual.tar +3 -0
  5. Vocoder_L24.tar +3 -0
  6. framework_all.png +3 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ framework_all.png filter=lfs diff=lfs merge=lfs -text
DeWavLM.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e347a95ad01d15f0c3ec86804f1177e2b197e16012718f03fde152bb4e78b74f
3
+ size 1261989218
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-to-audio
4
+ ---
5
+ # PASE: Phonologically Anchored Speech Enhancer
6
+
7
+ PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.
8
+
9
+ ---
10
+
11
+ ## Model Details
12
+
13
+ ### Model Description
14
+
15
+
16
+ <img src="framework_all.png" alt="High-level system design" width="80%">
17
+
18
+ PASE contains two main components:
19
+
20
+ - **Denoising WavLM (DeWavLM)**
21
+ Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
22
+ Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.
23
+
24
+ - **Dual‑Stream Vocoder**
25
+ Reconstructs audio using DeWavLM's dual-stream representations:
26
+ - **Phonetic representation**: high-level linguistic structure
27
+ - **Acoustic representation**: speaker identity and prosody
28
+
29
+ **Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
30
+ **Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
31
+ **Model type:** Generative Speech Enhancement
32
+ **License:** Apache 2.0
33
+ **Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)
34
+
35
+
36
+ ---
37
+
38
+ ### Model Sources
39
+
40
+ - **Repository:** https://github.com/cisco-open/pase
41
+ - **Paper:** https://arxiv.org/abs/2511.13300
42
+ - **Demo:** https://xiaobin-rong.github.io/pase_demo/
43
+ ---
44
+ ## Uses
45
+ ### Direct Use
46
+ - Enhance noisy or reverberant speech recordings
47
+ - Improve perceptual quality and intelligibility
48
+ - Preserve speaker identity and linguistic content
49
+ - Supports **16 kHz mono audio**
50
+ ### Out-of-Scope Use
51
+ - Medical, legal, or safety‑critical decisions
52
+ - Voice conversion or identity manipulation
53
+ - Non‑speech audio enhancement
54
+ ---
55
+ ## How to Get Started
56
+ Refer to the repository for quick-start code and examples:
57
+ https://github.com/cisco-open/pase
58
+
59
+ ---
60
+ ## Training Details
61
+ ### Training Data
62
+ We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:
63
+
64
+ - Clean speech:
65
+ - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
66
+ - [LibriTTS](https://www.openslr.org/60/)
67
+ - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
68
+ - Noise:
69
+ - DNS5 Challenge noise resources
70
+ - Room impulse responses:
71
+ - [OpenSLR26](https://www.openslr.org/26/)
72
+ - [OpenSLR28](https://www.openslr.org/28/)
73
+
74
+ These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.
75
+
76
+ ### Dataset Attribution
77
+ - DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
78
+ - LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
79
+ - VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
80
+ - DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
81
+ - OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.
82
+
83
+ All audio was resampled to 16 kHz.
84
+
85
+ ### Training Procedure
86
+ #### Preprocessing
87
+ - Mixtures generated dynamically
88
+ - SNR sampled from –5 to 15 dB
89
+ - Reverberation applied with 50% probability
90
+ #### Training Hyperparameters
91
+ - **DeWavLM:** 100k steps, LR 1e‑4, batch size 4
92
+ - **Vocoder:** 200k steps, LR 2e‑4, batch size 12
93
+ - Optimizer: AdamW with warmup + cosine decay
94
+ - Hardware: 4 × NVIDIA RTX 4090 GPUs
95
+ #### Speeds, Sizes, Times
96
+ - Total parameters: ~382M
97
+ - Inference compute: ~21.4 GMAC/s
98
+ ---
99
+ ## Evaluation
100
+ ### Testing Data
101
+ - Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split)
102
+ - [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
103
+ ### Metrics
104
+ - DNSMOS, UTMOS
105
+ - LPS, SpeechBERTScore (SBS)
106
+ - Speaker Similarity (RawNet3)
107
+ - WER (OWSM v3.1)
108
+
109
+ ### Results
110
+
111
+ The performance of the released version compared to the paper's results:
112
+ | Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
113
+ |:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
114
+ | Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
115
+ | **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
116
+ | DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
117
+ | **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
118
+ | PASE (paper) | 3.12 | 3.09 |0.90 |0.93 |0.80 | 7.49 |
119
+ | **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
120
+
121
+ It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.
122
+
123
+ Overall, PASE achieves:
124
+ - Lowest WER among evaluated generative and discriminative baselines
125
+ - Highest speaker similarity (SpkSim)
126
+ - Strong perceptual quality with low hallucination rates
127
+ - Consistent performance across noisy and reverberant conditions
128
+
129
+ ---
130
+ ## Bias, Risks, and Limitations
131
+ - Model trained primarily on English speech; performance may degrade for other languages.
132
+ - Very strong noise or mismatched reverberation conditions can introduce artifacts.
133
+ - Speaker characteristics are preserved but not guaranteed perfectly.
134
+
135
+ ---
136
+ ### Recommendations
137
+ Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.
138
+
139
+ ---
140
+ ## Citation
141
+ If you use PASE in your research, please cite:
142
+ ```bibtex
143
+ @article{PASE,
144
+ title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
145
+ volume={40},
146
+ DOI={10.1609/aaai.v40i39.40562},
147
+ number={39},
148
+ journal={Proceedings of the AAAI Conference on Artificial Intelligence},
149
+ author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
150
+ year={2026},
151
+ month={Mar.},
152
+ pages={32826-32834}
153
+ }
154
+ ```
155
+ Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
156
+ ## Model Card Authorship & Contact
157
+ - Mansur Yesilbursa: myesilbu@cisco.com
158
+
Vocoder_Dual.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:400ea906dae8d1fccfc108e316925092b60836ad6320e87e5be6a7eeb7d65875
3
+ size 266819375
Vocoder_L24.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:910c14256faa1eda574fb11f770eb7efc175e08c7fa67a2f597938fd44f00d87
3
+ size 262620164
framework_all.png ADDED

Git LFS Details

  • SHA256: 1d39d41e869029b111509fa5300a0d8525c4fdc3929962809c6db9a7cabc4e49
  • Pointer size: 131 Bytes
  • Size of remote file: 203 kB