richiejp commited on
Commit
79b6332
Β·
verified Β·
1 Parent(s): 74fd5a7

Sync model card with upstream GitHub inference README

Browse files
Files changed (1) hide show
  1. README.md +48 -79
README.md CHANGED
@@ -20,16 +20,17 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
20
  - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
21
  - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
22
  PyTorch reference included for verification and research
23
- - Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
24
- to support future Q4_K / Q8_0 native inference
25
  - Apache 2.0
26
 
27
  This page is the Hugging Face model card β€” it hosts the published weights.
28
  Source code, build system, tests, and training pipeline live in the GitHub
29
  repository: <https://github.com/LocalAI-io/LocalVQE>.
30
 
 
 
 
31
  The technical report describing the architecture, streaming-state contract,
32
- and BatchNorm folding rules used for deployment is included in this repo as
33
  [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
34
  like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
35
  from an existing author in those categories β€” if you can endorse, please
@@ -42,13 +43,9 @@ reach out via the GitHub repo.
42
  LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
43
  *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
44
  Cancellation, Noise Suppression and Dereverberation*,
45
- [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's
46
- overall topology (mic/far-end encoders, soft-delay cross attention, decoder
47
- with sub-pixel upsampling, complex convolving mask) but replaces the STFT
48
- with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
49
- state-space model (S4D), and is ~9Γ— smaller than the reference DeepVQE.
50
- Everything specific to LocalVQE is original to this repository β€” there is
51
- no LocalVQE paper.
52
 
53
  ## A concrete example
54
 
@@ -89,49 +86,38 @@ small fraction of a real-time budget.
89
 
90
  ## Why this, and not DeepVQE?
91
 
92
- Microsoft never released DeepVQE β€” no weights, no reference implementation,
93
- no streaming runtime. We re-implemented it from the paper as a GGML graph
94
- at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the
95
- full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing
96
- that implementation to this one:
97
-
98
- | | DeepVQE (our re-implementation) | LocalVQE |
99
- |---|---|---|
100
- | Parameters | ~7.5 M | 1.3 M |
101
- | Weights (F32) | ~30 MB | ~5 MB |
102
- | Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
103
- | Bottleneck | GRU | S4D (diagonal state space) |
104
- | CCM arithmetic | Complex | Real-valued (GGML-friendly) |
105
- | Streaming inference | Yes, separate repo | Yes, in this repo |
106
-
107
- The smaller parameter count comes from iterative channel pruning of the
108
- full-width reference, not from distillation; S4D halves the bottleneck
109
- parameter count vs GRU at similar quality.
110
 
111
  ## Files in this repository
112
 
113
  | File | Size | Description |
114
  |---|---|---|
115
- | `localvqe-v1-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
116
- | `localvqe-v1-1.3M-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. |
117
 
118
- Only F32 GGUF is published today. A `quantize` tool is included in the C++
119
- build (see below) and the architecture is designed to be Q4_K / Q8_0
120
- friendly, but quantized weights have not yet been calibrated and released.
121
 
122
  ## Validation Results
123
 
124
- Stratified 150-sample eval (30 per scenario) on the
125
  [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
126
  β€” real recordings, not synthetic mixes.
127
 
128
- | Scenario | AECMOS echo | AECMOS deg | blind ERLE |
129
- |---|---:|---:|---:|
130
- | doubletalk | 4.71 | 2.35 | 8.5 dB |
131
- | doubletalk-with-movement | 4.67 | 2.33 | 8.1 dB |
132
- | farend-singletalk | 4.12 | 4.94 | 40.6 dB |
133
- | farend-singletalk-with-movement | 4.31 | 4.98 | 39.0 dB |
134
- | nearend-singletalk | 5.00 | 4.15 | 1.9 dB |
135
 
136
  - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
137
  quality predictor. "Echo" rates how well echo was removed; "degradation"
@@ -141,21 +127,6 @@ Stratified 150-sample eval (30 per scenario) on the
141
  near-end speech it understates echo removal because both numerator and
142
  denominator are dominated by speech.
143
 
144
- ## Architecture
145
-
146
- | Component | Value |
147
- |---|---|
148
- | Sample rate | 16 kHz |
149
- | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
150
- | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
151
- | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 40 |
152
- | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
153
- | Bottleneck | S4D diagonal state-space, hidden 162 |
154
- | Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
155
- | CCM | 27-ch β†’ 3Γ—3 complex convolving mask (real-valued arithmetic) |
156
- | Kernel | (4, 4) time Γ— freq, causal padding |
157
- | Parameters | 1.3 M |
158
-
159
  ## Building the C++ Inference Engine
160
 
161
  Source, build system, and tests live at
@@ -199,33 +170,33 @@ glslc`/`shaderc`).
199
 
200
  ### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)
201
 
202
- Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters Γ— 187 hops
203
- = 5 610 streaming hops per backend. Each hop is a full
204
- `ggml_backend_graph_compute`.
205
 
206
- | Backend | p50 | p99 | max (quiet) | max (with load) |
207
- |-----------------------------|--------:|--------:|------------:|----------------:|
208
- | CPU β€” 1 thread | 3.46 ms | 3.59 ms | 4.93 ms | β€” |
209
- | CPU β€” 2 threads | 2.05 ms | 2.17 ms | 3.34 ms | β€” |
210
- | CPU β€” 4 threads | 1.26 ms | 1.48 ms | 3.07 ms | β€” |
211
- | Vulkan β€” AMD iGPU (RADV) | 1.68 ms | 1.77 ms | 3.40 ms | 37.50 ms |
212
- | Vulkan β€” NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms | 3.40 ms | 31.72 ms |
213
 
214
  Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
215
- shared desktop is sensitive to external GPU clients (display compositor,
216
- browser). On a dedicated embedded device with no compositor contending
217
- for the queue, the "quiet" column is what you'll see.
 
218
 
219
  ## Running Inference
220
 
221
- Download `localvqe-v1-1.3M-f32.gguf` from this repository (the file list above)
222
  either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
223
  `huggingface_hub`. Then:
224
 
225
  ### CLI
226
 
227
  ```bash
228
- ./ggml/build/bin/localvqe localvqe-v1-1.3M-f32.gguf \
229
  --in-wav mic.wav ref.wav \
230
  --out-wav enhanced.wav
231
  ```
@@ -235,7 +206,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
235
  ### Benchmark
236
 
237
  ```bash
238
- ./ggml/build/bin/bench localvqe-v1-1.3M-f32.gguf \
239
  --in-wav mic.wav ref.wav --iters 10 --profile
240
  ```
241
 
@@ -252,14 +223,12 @@ integration.
252
 
253
  ### Quantizing (experimental)
254
 
255
- The model was designed with quantization in mind β€” power-of-two channel
256
- widths, kernel area 16, GGML-friendly real-valued arithmetic β€” but
257
- calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool
258
- in the C++ build can produce GGUF variants from the F32 reference for
259
- experimentation:
260
 
261
  ```bash
262
- ./ggml/build/bin/quantize localvqe-v1-1.3M-f32.gguf localvqe-v1-1.3M-q8.gguf Q8_0
263
  ```
264
 
265
  Expect end-to-end quality loss until proper per-tensor selection and
@@ -267,7 +236,7 @@ calibration have been worked through.
267
 
268
  ## PyTorch Reference
269
 
270
- `localvqe-v1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
271
  It is provided for verification, ablation, and downstream research β€” not
272
  for end-user inference, which should go through the GGML build above. The
273
  model definition lives under `pytorch/` in the
 
20
  - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
21
  - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
22
  PyTorch reference included for verification and research
 
 
23
  - Apache 2.0
24
 
25
  This page is the Hugging Face model card β€” it hosts the published weights.
26
  Source code, build system, tests, and training pipeline live in the GitHub
27
  repository: <https://github.com/LocalAI-io/LocalVQE>.
28
 
29
+ The current release is **v1.1**, which fixes intermittent crackling the
30
+ previous release produced under heavy background noise.
31
+
32
  The technical report describing the architecture, streaming-state contract,
33
+ and streaming-causal normalisation operator is included in this repo as
34
  [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
35
  like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
36
  from an existing author in those categories β€” if you can endorse, please
 
43
  LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
44
  *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
45
  Cancellation, Noise Suppression and Dereverberation*,
46
+ [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) β€” smaller, GGML-native,
47
+ and tuned for streaming CPU inference. The architecture is documented in
48
+ the technical report linked above.
 
 
 
 
49
 
50
  ## A concrete example
51
 
 
86
 
87
  ## Why this, and not DeepVQE?
88
 
89
+ Microsoft never released DeepVQE β€” no weights, no reference
90
+ implementation, no streaming runtime. We re-implemented it from the
91
+ paper as a GGML graph at
92
+ [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
93
+ (the full-width ~7.5 M-parameter version) before starting LocalVQE.
94
+ LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
95
+ (~5 MB F32), small enough to run on commodity CPUs in real time.
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ## Files in this repository
98
 
99
  | File | Size | Description |
100
  |---|---|---|
101
+ | `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
102
+ | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export β€” what the C++ inference engine loads. |
103
 
104
+ Only F32 GGUF is published today. A `quantize` tool is included in the
105
+ C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
106
+ released.
107
 
108
  ## Validation Results
109
 
110
+ Full 800-clip eval on the
111
  [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
112
  β€” real recordings, not synthetic mixes.
113
 
114
+ | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
115
+ |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
116
+ | doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 |
117
+ | doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 |
118
+ | farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 |
119
+ | farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 |
120
+ | nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 |
121
 
122
  - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
123
  quality predictor. "Echo" rates how well echo was removed; "degradation"
 
127
  near-end speech it understates echo removal because both numerator and
128
  denominator are dominated by speech.
129
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ## Building the C++ Inference Engine
131
 
132
  Source, build system, and tests live at
 
170
 
171
  ### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)
172
 
173
+ Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
174
+ full `ggml_backend_graph_compute`.
 
175
 
176
+ | Backend | Threads | p50 | p99 | max |
177
+ |-----------------------------|--------:|--------:|--------:|--------:|
178
+ | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
179
+ | CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
180
+ | CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
181
+ | Vulkan β€” AMD iGPU (RADV) | β€” | 4.43 ms | 4.62 ms | 5.07 ms |
182
+ | Vulkan β€” NVIDIA RTX 5070 Ti | β€” | 1.79 ms | 3.41 ms | 4.14 ms |
183
 
184
  Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
185
+ shared desktop is sensitive to external GPU clients (display
186
+ compositor, browser). On a dedicated embedded device with no
187
+ compositor contending for the queue, expect the quieter end of the
188
+ range.
189
 
190
  ## Running Inference
191
 
192
+ Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
193
  either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
194
  `huggingface_hub`. Then:
195
 
196
  ### CLI
197
 
198
  ```bash
199
+ ./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
200
  --in-wav mic.wav ref.wav \
201
  --out-wav enhanced.wav
202
  ```
 
206
  ### Benchmark
207
 
208
  ```bash
209
+ ./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
210
  --in-wav mic.wav ref.wav --iters 10 --profile
211
  ```
212
 
 
223
 
224
  ### Quantizing (experimental)
225
 
226
+ Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
227
+ tool in the C++ build can produce GGUF variants from the F32 reference
228
+ for experimentation:
 
 
229
 
230
  ```bash
231
+ ./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
232
  ```
233
 
234
  Expect end-to-end quality loss until proper per-tensor selection and
 
236
 
237
  ## PyTorch Reference
238
 
239
+ `localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
240
  It is provided for verification, ablation, and downstream research β€” not
241
  for end-user inference, which should go through the GGML build above. The
242
  model definition lives under `pytorch/` in the