Sync model card with upstream GitHub inference README
Browse files
README.md
CHANGED
|
@@ -20,16 +20,17 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
|
|
| 20 |
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
|
| 21 |
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
|
| 22 |
PyTorch reference included for verification and research
|
| 23 |
-
- Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
|
| 24 |
-
to support future Q4_K / Q8_0 native inference
|
| 25 |
- Apache 2.0
|
| 26 |
|
| 27 |
This page is the Hugging Face model card β it hosts the published weights.
|
| 28 |
Source code, build system, tests, and training pipeline live in the GitHub
|
| 29 |
repository: <https://github.com/LocalAI-io/LocalVQE>.
|
| 30 |
|
|
|
|
|
|
|
|
|
|
| 31 |
The technical report describing the architecture, streaming-state contract,
|
| 32 |
-
and
|
| 33 |
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
|
| 34 |
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
|
| 35 |
from an existing author in those categories β if you can endorse, please
|
|
@@ -42,13 +43,9 @@ reach out via the GitHub repo.
|
|
| 42 |
LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β
|
| 43 |
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
|
| 44 |
Cancellation, Noise Suppression and Dereverberation*,
|
| 45 |
-
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177))
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
|
| 49 |
-
state-space model (S4D), and is ~9Γ smaller than the reference DeepVQE.
|
| 50 |
-
Everything specific to LocalVQE is original to this repository β there is
|
| 51 |
-
no LocalVQE paper.
|
| 52 |
|
| 53 |
## A concrete example
|
| 54 |
|
|
@@ -89,49 +86,38 @@ small fraction of a real-time budget.
|
|
| 89 |
|
| 90 |
## Why this, and not DeepVQE?
|
| 91 |
|
| 92 |
-
Microsoft never released DeepVQE β no weights, no reference
|
| 93 |
-
no streaming runtime. We re-implemented it from the
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|---|---|---|
|
| 100 |
-
| Parameters | ~7.5 M | 1.3 M |
|
| 101 |
-
| Weights (F32) | ~30 MB | ~5 MB |
|
| 102 |
-
| Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
|
| 103 |
-
| Bottleneck | GRU | S4D (diagonal state space) |
|
| 104 |
-
| CCM arithmetic | Complex | Real-valued (GGML-friendly) |
|
| 105 |
-
| Streaming inference | Yes, separate repo | Yes, in this repo |
|
| 106 |
-
|
| 107 |
-
The smaller parameter count comes from iterative channel pruning of the
|
| 108 |
-
full-width reference, not from distillation; S4D halves the bottleneck
|
| 109 |
-
parameter count vs GRU at similar quality.
|
| 110 |
|
| 111 |
## Files in this repository
|
| 112 |
|
| 113 |
| File | Size | Description |
|
| 114 |
|---|---|---|
|
| 115 |
-
| `localvqe-v1-1.3M.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
|
| 116 |
-
| `localvqe-v1-1.3M-f32.gguf` | 5 MB | GGML F32 export
|
| 117 |
|
| 118 |
-
Only F32 GGUF is published today. A `quantize` tool is included in the
|
| 119 |
-
build (see below)
|
| 120 |
-
|
| 121 |
|
| 122 |
## Validation Results
|
| 123 |
|
| 124 |
-
|
| 125 |
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
| 126 |
β real recordings, not synthetic mixes.
|
| 127 |
|
| 128 |
-
| Scenario | AECMOS echo | AECMOS deg | blind ERLE |
|
| 129 |
-
|---|---:|---:|---:|
|
| 130 |
-
| doubletalk
|
| 131 |
-
| doubletalk-with-movement
|
| 132 |
-
| farend-singletalk
|
| 133 |
-
| farend-singletalk-with-movement
|
| 134 |
-
| nearend-singletalk
|
| 135 |
|
| 136 |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 137 |
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
|
@@ -141,21 +127,6 @@ Stratified 150-sample eval (30 per scenario) on the
|
|
| 141 |
near-end speech it understates echo removal because both numerator and
|
| 142 |
denominator are dominated by speech.
|
| 143 |
|
| 144 |
-
## Architecture
|
| 145 |
-
|
| 146 |
-
| Component | Value |
|
| 147 |
-
|---|---|
|
| 148 |
-
| Sample rate | 16 kHz |
|
| 149 |
-
| Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
|
| 150 |
-
| Mic encoder | 5 blocks: 2 β 32 β 40 β 40 β 40 β 40 |
|
| 151 |
-
| Far-end encoder | 2 blocks: 2 β 32 β 40 |
|
| 152 |
-
| AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
|
| 153 |
-
| Bottleneck | S4D diagonal state-space, hidden 162 |
|
| 154 |
-
| Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
|
| 155 |
-
| CCM | 27-ch β 3Γ3 complex convolving mask (real-valued arithmetic) |
|
| 156 |
-
| Kernel | (4, 4) time Γ freq, causal padding |
|
| 157 |
-
| Parameters | 1.3 M |
|
| 158 |
-
|
| 159 |
## Building the C++ Inference Engine
|
| 160 |
|
| 161 |
Source, build system, and tests live at
|
|
@@ -199,33 +170,33 @@ glslc`/`shaderc`).
|
|
| 199 |
|
| 200 |
### Streaming latency (per-hop, 16 kHz / 256-sample hop β 16 ms budget)
|
| 201 |
|
| 202 |
-
Measured with `bench` on Zen4 desktop (Ryzen 9 7900)
|
| 203 |
-
|
| 204 |
-
`ggml_backend_graph_compute`.
|
| 205 |
|
| 206 |
-
| Backend | p50 | p99 | max
|
| 207 |
-
|-----------------------------|--------:|--------:|--------
|
| 208 |
-
| CPU
|
| 209 |
-
| CPU
|
| 210 |
-
| CPU
|
| 211 |
-
| Vulkan β AMD iGPU (RADV) |
|
| 212 |
-
| Vulkan β NVIDIA RTX 5070 Ti |
|
| 213 |
|
| 214 |
Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
|
| 215 |
-
shared desktop is sensitive to external GPU clients (display
|
| 216 |
-
browser). On a dedicated embedded device with no
|
| 217 |
-
for the queue, the
|
|
|
|
| 218 |
|
| 219 |
## Running Inference
|
| 220 |
|
| 221 |
-
Download `localvqe-v1-1.3M-f32.gguf` from this repository (the file list above)
|
| 222 |
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
|
| 223 |
`huggingface_hub`. Then:
|
| 224 |
|
| 225 |
### CLI
|
| 226 |
|
| 227 |
```bash
|
| 228 |
-
./ggml/build/bin/localvqe localvqe-v1-1.3M-f32.gguf \
|
| 229 |
--in-wav mic.wav ref.wav \
|
| 230 |
--out-wav enhanced.wav
|
| 231 |
```
|
|
@@ -235,7 +206,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
|
|
| 235 |
### Benchmark
|
| 236 |
|
| 237 |
```bash
|
| 238 |
-
./ggml/build/bin/bench localvqe-v1-1.3M-f32.gguf \
|
| 239 |
--in-wav mic.wav ref.wav --iters 10 --profile
|
| 240 |
```
|
| 241 |
|
|
@@ -252,14 +223,12 @@ integration.
|
|
| 252 |
|
| 253 |
### Quantizing (experimental)
|
| 254 |
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
in the C++ build can produce GGUF variants from the F32 reference for
|
| 259 |
-
experimentation:
|
| 260 |
|
| 261 |
```bash
|
| 262 |
-
./ggml/build/bin/quantize localvqe-v1-1.3M-f32.gguf localvqe-v1-1.3M-q8.gguf Q8_0
|
| 263 |
```
|
| 264 |
|
| 265 |
Expect end-to-end quality loss until proper per-tensor selection and
|
|
@@ -267,7 +236,7 @@ calibration have been worked through.
|
|
| 267 |
|
| 268 |
## PyTorch Reference
|
| 269 |
|
| 270 |
-
`localvqe-v1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
|
| 271 |
It is provided for verification, ablation, and downstream research β not
|
| 272 |
for end-user inference, which should go through the GGML build above. The
|
| 273 |
model definition lives under `pytorch/` in the
|
|
|
|
| 20 |
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
|
| 21 |
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
|
| 22 |
PyTorch reference included for verification and research
|
|
|
|
|
|
|
| 23 |
- Apache 2.0
|
| 24 |
|
| 25 |
This page is the Hugging Face model card β it hosts the published weights.
|
| 26 |
Source code, build system, tests, and training pipeline live in the GitHub
|
| 27 |
repository: <https://github.com/LocalAI-io/LocalVQE>.
|
| 28 |
|
| 29 |
+
The current release is **v1.1**, which fixes intermittent crackling the
|
| 30 |
+
previous release produced under heavy background noise.
|
| 31 |
+
|
| 32 |
The technical report describing the architecture, streaming-state contract,
|
| 33 |
+
and streaming-causal normalisation operator is included in this repo as
|
| 34 |
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
|
| 35 |
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
|
| 36 |
from an existing author in those categories β if you can endorse, please
|
|
|
|
| 43 |
LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β
|
| 44 |
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
|
| 45 |
Cancellation, Noise Suppression and Dereverberation*,
|
| 46 |
+
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) β smaller, GGML-native,
|
| 47 |
+
and tuned for streaming CPU inference. The architecture is documented in
|
| 48 |
+
the technical report linked above.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
## A concrete example
|
| 51 |
|
|
|
|
| 86 |
|
| 87 |
## Why this, and not DeepVQE?
|
| 88 |
|
| 89 |
+
Microsoft never released DeepVQE β no weights, no reference
|
| 90 |
+
implementation, no streaming runtime. We re-implemented it from the
|
| 91 |
+
paper as a GGML graph at
|
| 92 |
+
[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
|
| 93 |
+
(the full-width ~7.5 M-parameter version) before starting LocalVQE.
|
| 94 |
+
LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
|
| 95 |
+
(~5 MB F32), small enough to run on commodity CPUs in real time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
## Files in this repository
|
| 98 |
|
| 99 |
| File | Size | Description |
|
| 100 |
|---|---|---|
|
| 101 |
+
| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
|
| 102 |
+
| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export β what the C++ inference engine loads. |
|
| 103 |
|
| 104 |
+
Only F32 GGUF is published today. A `quantize` tool is included in the
|
| 105 |
+
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
|
| 106 |
+
released.
|
| 107 |
|
| 108 |
## Validation Results
|
| 109 |
|
| 110 |
+
Full 800-clip eval on the
|
| 111 |
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
| 112 |
β real recordings, not synthetic mixes.
|
| 113 |
|
| 114 |
+
| Scenario | n | AECMOS echo β | AECMOS deg β | blind ERLE β | DNSMOS OVRL β |
|
| 115 |
+
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
|
| 116 |
+
| doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 |
|
| 117 |
+
| doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 |
|
| 118 |
+
| farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 |
|
| 119 |
+
| farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 |
|
| 120 |
+
| nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 |
|
| 121 |
|
| 122 |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 123 |
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
|
|
|
| 127 |
near-end speech it understates echo removal because both numerator and
|
| 128 |
denominator are dominated by speech.
|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
## Building the C++ Inference Engine
|
| 131 |
|
| 132 |
Source, build system, and tests live at
|
|
|
|
| 170 |
|
| 171 |
### Streaming latency (per-hop, 16 kHz / 256-sample hop β 16 ms budget)
|
| 172 |
|
| 173 |
+
Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
|
| 174 |
+
full `ggml_backend_graph_compute`.
|
|
|
|
| 175 |
|
| 176 |
+
| Backend | Threads | p50 | p99 | max |
|
| 177 |
+
|-----------------------------|--------:|--------:|--------:|--------:|
|
| 178 |
+
| CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
|
| 179 |
+
| CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
|
| 180 |
+
| CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
|
| 181 |
+
| Vulkan β AMD iGPU (RADV) | β | 4.43 ms | 4.62 ms | 5.07 ms |
|
| 182 |
+
| Vulkan β NVIDIA RTX 5070 Ti | β | 1.79 ms | 3.41 ms | 4.14 ms |
|
| 183 |
|
| 184 |
Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
|
| 185 |
+
shared desktop is sensitive to external GPU clients (display
|
| 186 |
+
compositor, browser). On a dedicated embedded device with no
|
| 187 |
+
compositor contending for the queue, expect the quieter end of the
|
| 188 |
+
range.
|
| 189 |
|
| 190 |
## Running Inference
|
| 191 |
|
| 192 |
+
Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
|
| 193 |
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
|
| 194 |
`huggingface_hub`. Then:
|
| 195 |
|
| 196 |
### CLI
|
| 197 |
|
| 198 |
```bash
|
| 199 |
+
./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
|
| 200 |
--in-wav mic.wav ref.wav \
|
| 201 |
--out-wav enhanced.wav
|
| 202 |
```
|
|
|
|
| 206 |
### Benchmark
|
| 207 |
|
| 208 |
```bash
|
| 209 |
+
./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
|
| 210 |
--in-wav mic.wav ref.wav --iters 10 --profile
|
| 211 |
```
|
| 212 |
|
|
|
|
| 223 |
|
| 224 |
### Quantizing (experimental)
|
| 225 |
|
| 226 |
+
Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
|
| 227 |
+
tool in the C++ build can produce GGUF variants from the F32 reference
|
| 228 |
+
for experimentation:
|
|
|
|
|
|
|
| 229 |
|
| 230 |
```bash
|
| 231 |
+
./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
|
| 232 |
```
|
| 233 |
|
| 234 |
Expect end-to-end quality loss until proper per-tensor selection and
|
|
|
|
| 236 |
|
| 237 |
## PyTorch Reference
|
| 238 |
|
| 239 |
+
`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
|
| 240 |
It is provided for verification, ablation, and downstream research β not
|
| 241 |
for end-user inference, which should go through the GGML build above. The
|
| 242 |
model definition lives under `pytorch/` in the
|