LocalAI-io
/

LocalVQE

@@ -20,16 +20,17 @@ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
 - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
 - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
   PyTorch reference included for verification and research
-- Quantization-friendly by design (power-of-2 channel widths, kernel area 16)
-  to support future Q4_K / Q8_0 native inference
 - Apache 2.0
 This page is the Hugging Face model card — it hosts the published weights.
 Source code, build system, tests, and training pipeline live in the GitHub
 repository: <https://github.com/LocalAI-io/LocalVQE>.
 The technical report describing the architecture, streaming-state contract,
-and BatchNorm folding rules used for deployment is included in this repo as
 [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
 like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
 from an existing author in those categories — if you can endorse, please
@@ -42,13 +43,9 @@ reach out via the GitHub repo.
 LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
 *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
 Cancellation, Noise Suppression and Dereverberation*,
-[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). It keeps DeepVQE's
-overall topology (mic/far-end encoders, soft-delay cross attention, decoder
-with sub-pixel upsampling, complex convolving mask) but replaces the STFT
-with an in-graph DCT-II filterbank, swaps the GRU bottleneck for a diagonal
-state-space model (S4D), and is ~9× smaller than the reference DeepVQE.
-Everything specific to LocalVQE is original to this repository — there is
-no LocalVQE paper.
 ## A concrete example
@@ -89,49 +86,38 @@ small fraction of a real-time budget.
 ## Why this, and not DeepVQE?
-Microsoft never released DeepVQE — no weights, no reference implementation,
-no streaming runtime. We re-implemented it from the paper as a GGML graph
-at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the
-full-width ~7.5 M-parameter version) before starting LocalVQE. Comparing
-that implementation to this one:
-| | DeepVQE (our re-implementation) | LocalVQE |
-|---|---|---|
-| Parameters | ~7.5 M | 1.3 M |
-| Weights (F32) | ~30 MB | ~5 MB |
-| Analysis | STFT (complex FFT) | DCT-II (real, in-graph) |
-| Bottleneck | GRU | S4D (diagonal state space) |
-| CCM arithmetic | Complex | Real-valued (GGML-friendly) |
-| Streaming inference | Yes, separate repo | Yes, in this repo |
-The smaller parameter count comes from iterative channel pruning of the
-full-width reference, not from distillation; S4D halves the bottleneck
-parameter count vs GRU at similar quality.
 ## Files in this repository
 | File | Size | Description |
 |---|---|---|
-| `localvqe-v1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
-| `localvqe-v1-1.3M-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). This is what the C++ inference engine loads. |
-Only F32 GGUF is published today. A `quantize` tool is included in the C++
-build (see below) and the architecture is designed to be Q4_K / Q8_0
-friendly, but quantized weights have not yet been calibrated and released.
 ## Validation Results
-Stratified 150-sample eval (30 per scenario) on the
 [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
 — real recordings, not synthetic mixes.
-| Scenario | AECMOS echo | AECMOS deg | blind ERLE |
-|---|---:|---:|---:|
-| doubletalk | 4.71 | 2.35 | 8.5 dB |
-| doubletalk-with-movement | 4.67 | 2.33 | 8.1 dB |
-| farend-singletalk | 4.12 | 4.94 | 40.6 dB |
-| farend-singletalk-with-movement | 4.31 | 4.98 | 39.0 dB |
-| nearend-singletalk | 5.00 | 4.15 | 1.9 dB |
 - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
   quality predictor. "Echo" rates how well echo was removed; "degradation"
@@ -141,21 +127,6 @@ Stratified 150-sample eval (30 per scenario) on the
   near-end speech it understates echo removal because both numerator and
   denominator are dominated by speech.
-## Architecture
-| Component | Value |
-|---|---|
-| Sample rate | 16 kHz |
-| Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
-| Mic encoder | 5 blocks: 2 → 32 → 40 → 40 → 40 → 40 |
-| Far-end encoder | 2 blocks: 2 → 32 → 40 |
-| AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
-| Bottleneck | S4D diagonal state-space, hidden 162 |
-| Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
-| CCM | 27-ch → 3×3 complex convolving mask (real-valued arithmetic) |
-| Kernel | (4, 4) time × freq, causal padding |
-| Parameters | 1.3 M |
 ## Building the C++ Inference Engine
 Source, build system, and tests live at
@@ -199,33 +170,33 @@ glslc`/`shaderc`).
 ### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
-Measured with `bench` on Zen4 desktop (Ryzen 9 7900), 30 iters × 187 hops
-= 5 610 streaming hops per backend. Each hop is a full
-`ggml_backend_graph_compute`.
-| Backend                     | p50     | p99     | max (quiet) | max (with load) |
-|-----------------------------|--------:|--------:|------------:|----------------:|
-| CPU — 1 thread              | 3.46 ms | 3.59 ms |     4.93 ms |             —   |
-| CPU — 2 threads             | 2.05 ms | 2.17 ms |     3.34 ms |             —   |
-| CPU — 4 threads             | 1.26 ms | 1.48 ms |     3.07 ms |             —   |
-| Vulkan — AMD iGPU (RADV)    | 1.68 ms | 1.77 ms |     3.40 ms |       37.50 ms  |
-| Vulkan — NVIDIA RTX 5070 Ti | 1.68 ms | 1.79 ms |     3.40 ms |       31.72 ms  |
 Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
-shared desktop is sensitive to external GPU clients (display compositor,
-browser). On a dedicated embedded device with no compositor contending
-for the queue, the "quiet" column is what you'll see.
 ## Running Inference
-Download `localvqe-v1-1.3M-f32.gguf` from this repository (the file list above)
 either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
 `huggingface_hub`. Then:
 ### CLI
 ```bash
-./ggml/build/bin/localvqe localvqe-v1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav \
     --out-wav enhanced.wav
 ```
@@ -235,7 +206,7 @@ Expects 16 kHz mono PCM for both mic and far-end reference.
 ### Benchmark
 ```bash
-./ggml/build/bin/bench localvqe-v1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav --iters 10 --profile
 ```
@@ -252,14 +223,12 @@ integration.
 ### Quantizing (experimental)
-The model was designed with quantization in mind — power-of-two channel
-widths, kernel area 16, GGML-friendly real-valued arithmetic — but
-calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool
-in the C++ build can produce GGUF variants from the F32 reference for
-experimentation:
 ```bash
-./ggml/build/bin/quantize localvqe-v1-1.3M-f32.gguf localvqe-v1-1.3M-q8.gguf Q8_0
 ```
 Expect end-to-end quality loss until proper per-tensor selection and
@@ -267,7 +236,7 @@ calibration have been worked through.
 ## PyTorch Reference
-`localvqe-v1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
 It is provided for verification, ablation, and downstream research — not
 for end-user inference, which should go through the GGML build above. The
 model definition lives under `pytorch/` in the

 - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
 - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
   PyTorch reference included for verification and research
 - Apache 2.0
 This page is the Hugging Face model card — it hosts the published weights.
 Source code, build system, tests, and training pipeline live in the GitHub
 repository: <https://github.com/LocalAI-io/LocalVQE>.
+The current release is **v1.1**, which fixes intermittent crackling the
+previous release produced under heavy background noise.
 The technical report describing the architecture, streaming-state contract,
+and streaming-causal normalisation operator is included in this repo as
 [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
 like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
 from an existing author in those categories — if you can endorse, please
 LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
 *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
 Cancellation, Noise Suppression and Dereverberation*,
+[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) — smaller, GGML-native,
+and tuned for streaming CPU inference. The architecture is documented in
+the technical report linked above.
 ## A concrete example
 ## Why this, and not DeepVQE?
+Microsoft never released DeepVQE — no weights, no reference
+implementation, no streaming runtime. We re-implemented it from the
+paper as a GGML graph at
+[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
+(the full-width ~7.5 M-parameter version) before starting LocalVQE.
+LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
+(~5 MB F32), small enough to run on commodity CPUs in real time.
 ## Files in this repository
 | File | Size | Description |
 |---|---|---|
+| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
+| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. |
+Only F32 GGUF is published today. A `quantize` tool is included in the
+C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
+released.
 ## Validation Results
+Full 800-clip eval on the
 [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
 — real recordings, not synthetic mixes.
+| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
+|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
+| doubletalk                        | 115 |          4.70 |         2.35 |       8.4 dB |          2.85 |
+| doubletalk-with-movement          | 185 |          4.63 |         2.35 |       8.3 dB |          2.80 |
+| farend-singletalk                 | 107 |          2.98 |         4.91 |      44.7 dB |          1.93 |
+| farend-singletalk-with-movement   | 193 |          3.40 |         4.95 |      45.0 dB |          1.91 |
+| nearend-singletalk                | 200 |          4.99 |         4.05 |       2.5 dB |          3.13 |
 - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
   quality predictor. "Echo" rates how well echo was removed; "degradation"
   near-end speech it understates echo removal because both numerator and
   denominator are dominated by speech.
 ## Building the C++ Inference Engine
 Source, build system, and tests live at
 ### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
+Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
+full `ggml_backend_graph_compute`.
+| Backend                     | Threads | p50     | p99     | max     |
+|-----------------------------|--------:|--------:|--------:|--------:|
+| CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
+| CPU                         |       2 | 2.07 ms | 2.25 ms | 3.65 ms |
+| CPU                         |       4 | 1.32 ms | 1.57 ms | 6.91 ms |
+| Vulkan — AMD iGPU (RADV)    |       — | 4.43 ms | 4.62 ms | 5.07 ms |
+| Vulkan — NVIDIA RTX 5070 Ti |       — | 1.79 ms | 3.41 ms | 4.14 ms |
 Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
+shared desktop is sensitive to external GPU clients (display
+compositor, browser). On a dedicated embedded device with no
+compositor contending for the queue, expect the quieter end of the
+range.
 ## Running Inference
+Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
 either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
 `huggingface_hub`. Then:
 ### CLI
 ```bash
+./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav \
     --out-wav enhanced.wav
 ```
 ### Benchmark
 ```bash
+./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
     --in-wav mic.wav ref.wav --iters 10 --profile
 ```
 ### Quantizing (experimental)
+Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
+tool in the C++ build can produce GGUF variants from the F32 reference
+for experimentation:
 ```bash
+./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
 ```
 Expect end-to-end quality loss until proper per-tensor selection and
 ## PyTorch Reference
+`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
 It is provided for verification, ablation, and downstream research — not
 for end-user inference, which should go through the GGML build above. The
 model definition lives under `pytorch/` in the