LocalAI-io
/

LocalVQE

@@ -11,29 +11,18 @@ license: apache-2.0
 # LocalVQE — Local Voice Quality Enhancement
-Real-time joint acoustic echo cancellation (AEC), noise suppression (NS),
-and dereverberation for 16 kHz speech.
-**Derivative of DeepVQE.** The overall architecture (mic encoder, far-end
-encoder, alignment cross-attention, bottleneck, decoder, complex convolving
-mask) is based on **DeepVQE** (Indenbom et al., Interspeech 2023 — *DeepVQE:
-Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation,
-Noise Suppression and Dereverberation*,
-[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). This fork replaces
-STFT with a DCT-II analysis/synthesis filterbank for quantisation-aware
-local CPU/GPU inference, swaps the GRU bottleneck for S4D, and ships a
-streaming GGML graph so the model can run on an F32 CPU pipeline in under
-2 ms/frame at 16 kHz.
-**There is no LocalVQE paper.** Everything specific to this project
-(DCT-II in-graph, GGML runtime, training pipeline, AEC fine-tune recipe)
-is original to this repo.
-**Authors:**
-- Richard Palethorpe ([richiejp](https://github.com/richiejp))
-- Claude (Anthropic)
-**Project source:** <https://github.com/richiejp/LocalVQE>
 ## Files
@@ -45,7 +34,10 @@ is original to this repo.
 ## Usage (GGML / C++ / Go)
 ```bash
 cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
 ./build/bin/localvqe localvqe-v1-f32.gguf \
     --in-wav mic.wav ref.wav --out-wav enhanced.wav
 ```
@@ -53,32 +45,10 @@ cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
 Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6× realtime at
 16 kHz / 256-sample hop).
-## Validation
-Best checkpoint of the published v1 F32 weights (`localvqe-v1-f32.gguf`), evaluated
-on a 1 000-clip validation split mixing DNS5-synthesised scenes and ICASSP
-AEC Challenge synthetic data; AECMOS over a 100-clip sub-sample.
-| Metric | Overall | Single-talk far-end | Double-talk |
-|---|---:|---:|---:|
-| ERLE | — | **+52.2 dB** | — |
-| AECMOS echo (↑) | 4.36 | 4.46 | 4.33 |
-| AECMOS degradation (↑) | 4.83 | 5.00 | 4.78 |
-ERLE is only meaningful on single-talk far-end (mic = pure echo); overall
-and double-talk ERLE are confounded by near-end speech and are omitted.
-DNSMOS OVRL is deliberately omitted: our P.808 scores (~2.0) contradict
-informal listening of the single-talk far-end output, which is audibly
-near-silent. We suspect our DNSMOS invocation is miscalibrated for AEC
-outputs — particularly near-silent clips, which are out of distribution
-for a speech-quality predictor — and would rather withhold the numbers
-than publish misleading ones.
 ## Architecture
 | Component | Value |
-|---|---|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 32 → 40 → 40 → 40 → 40 |
@@ -90,6 +60,16 @@ than publish misleading ones.
 | Kernel | (4, 4) time × freq, causal padding |
 | Parameters | ~0.9 M |
-## Licence
-Apache-2.0.

 # LocalVQE — Local Voice Quality Enhancement
+Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
+dereverberation for 16 kHz speech. A from-scratch derivative of **DeepVQE**
+(Indenbom et al., Interspeech 2023 — *DeepVQE: Real Time Deep Voice Quality
+Enhancement*, [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned
+for quantization-aware local CPU/GPU inference. The DCT-II analysis/synthesis
+(replacing STFT), S4D bottleneck, GGML streaming graph, and training pipeline
+are work of this project — no paper yet.
+**Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
+Claude (Anthropic).
+Project source: <https://github.com/richiejp/LocalVQE>
 ## Files
 ## Usage (GGML / C++ / Go)
 ```bash
+# Build the ggml binary
 cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
+# Run inference on a 16 kHz WAV pair
 ./build/bin/localvqe localvqe-v1-f32.gguf \
     --in-wav mic.wav ref.wav --out-wav enhanced.wav
 ```
 Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6× realtime at
 16 kHz / 256-sample hop).
 ## Architecture
 | Component | Value |
+|-----------|-------|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 32 → 40 → 40 → 40 → 40 |
 | Kernel | (4, 4) time × freq, causal padding |
 | Parameters | ~0.9 M |
+## Upstream citation (DeepVQE)
+```bibtex
+@inproceedings{indenbom2023deepvqe,
+  title={{DeepVQE}: Real Time Deep Voice Quality Enhancement for Joint Acoustic
+         Echo Cancellation, Noise Suppression and Dereverberation},
+  author={Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u a}t{\u a}lin and
+          Chernov, Mykola and Aichner, Robert},
+  booktitle={Interspeech},
+  year={2023},
+  doi={10.21437/Interspeech.2023-2176}
+}
+```