LocalAI-io
/

LocalVQE

@@ -11,14 +11,29 @@ license: apache-2.0
 # LocalVQE — Local Voice Quality Enhancement
-Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
-dereverberation for 16 kHz speech. Based on *LocalVQE: Real Time Deep Voice
-Quality Enhancement* (Indenbom et al., Interspeech 2023,
-[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned for
-quantization-aware CPU inference with DCT-II analysis/synthesis in place of
-STFT.
-Project source: <https://github.com/richiejp/LocalVQE>
 ## Files
@@ -32,10 +47,7 @@ Project source: <https://github.com/richiejp/LocalVQE>
 ## Usage (GGML / C++ / Go)
 ```bash
-# Build the ggml binary
 cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
-# Run inference on a 16 kHz WAV pair
 ./build/bin/localvqe localvqe-baseline.gguf \
     --in-wav mic.wav ref.wav --out-wav enhanced.wav
 ```
@@ -46,24 +58,17 @@ Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6× realtime at
 ## Architecture
 | Component | Value |
-|-----------|-------|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 64 → 128 → 128 → 128 → 128 |
 | Far-end encoder | 2 blocks: 2 → 32 → 128 |
 | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
-| Bottleneck | GRU(1152 → 576) + Linear(576 → 1152) |
 | Decoder | 5 sub-pixel conv + BN blocks: 128 → 128 → 128 → 64 → 64 |
-| CCM | 27-ch → 3×3 complex convolving mask (real-valued arithmetic) |
 | Parameters | ~8.0 M |
-## Citation
-```bibtex
-@inproceedings{indenbom2023localvqe,
-  title={{LocalVQE}: Real Time Deep Voice Quality Enhancement},
-  author={Indenbom, Evgenii and Beresford, Nicolae-C{\u a}t{\u a}lin and Adiban, Mohammad and Klej{\v c}ek, Ji{\v r}{\' i}},
-  booktitle={Interspeech},
-  year={2023}
-}
-```

 # LocalVQE — Local Voice Quality Enhancement
+Real-time joint acoustic echo cancellation (AEC), noise suppression (NS),
+and dereverberation for 16 kHz speech.
+**Derivative of DeepVQE.** The overall architecture (mic encoder, far-end
+encoder, alignment cross-attention, bottleneck, decoder, complex convolving
+mask) is based on **DeepVQE** (Indenbom et al., Interspeech 2023 — *DeepVQE:
+Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation,
+Noise Suppression and Dereverberation*,
+[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). This fork replaces
+STFT with a DCT-II analysis/synthesis filterbank for quantisation-aware
+local CPU/GPU inference, swaps the GRU bottleneck for S4D, and ships a
+streaming GGML graph so the model can run on an F32 CPU pipeline in under
+2 ms/frame at 16 kHz.
+**There is no LocalVQE paper.** Everything specific to this project
+(DCT-II in-graph, GGML runtime, training pipeline, AEC fine-tune recipe)
+is original to this repo.
+**Authors:**
+- Richard Palethorpe ([richiejp](https://github.com/richiejp))
+- Claude (Anthropic)
+**Project source:** <https://github.com/richiejp/LocalVQE>
 ## Files
 ## Usage (GGML / C++ / Go)
 ```bash
 cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
 ./build/bin/localvqe localvqe-baseline.gguf \
     --in-wav mic.wav ref.wav --out-wav enhanced.wav
 ```
 ## Architecture
 | Component | Value |
+|---|---|
 | Sample rate | 16 kHz |
 | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
 | Mic encoder | 5 blocks: 2 → 64 → 128 → 128 → 128 → 128 |
 | Far-end encoder | 2 blocks: 2 → 32 → 128 |
 | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
+| Bottleneck | S4D diagonal state-space |
 | Decoder | 5 sub-pixel conv + BN blocks: 128 → 128 → 128 → 64 → 64 |
+| CCM | 27-ch → 3x3 complex convolving mask (real-valued arithmetic) |
 | Parameters | ~8.0 M |
+## Licence
+Apache-2.0.