richiejp commited on
Commit
88876af
Β·
verified Β·
1 Parent(s): e43b08c

Fix fabricated citation; credit actual authors

Browse files
Files changed (1) hide show
  1. README.md +27 -22
README.md CHANGED
@@ -11,14 +11,29 @@ license: apache-2.0
11
 
12
  # LocalVQE β€” Local Voice Quality Enhancement
13
 
14
- Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
15
- dereverberation for 16 kHz speech. Based on *LocalVQE: Real Time Deep Voice
16
- Quality Enhancement* (Indenbom et al., Interspeech 2023,
17
- [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned for
18
- quantization-aware CPU inference with DCT-II analysis/synthesis in place of
19
- STFT.
20
 
21
- Project source: <https://github.com/richiejp/LocalVQE>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Files
24
 
@@ -32,10 +47,7 @@ Project source: <https://github.com/richiejp/LocalVQE>
32
  ## Usage (GGML / C++ / Go)
33
 
34
  ```bash
35
- # Build the ggml binary
36
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
37
-
38
- # Run inference on a 16 kHz WAV pair
39
  ./build/bin/localvqe localvqe-baseline.gguf \
40
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
41
  ```
@@ -46,24 +58,17 @@ Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
46
  ## Architecture
47
 
48
  | Component | Value |
49
- |-----------|-------|
50
  | Sample rate | 16 kHz |
51
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
52
  | Mic encoder | 5 blocks: 2 β†’ 64 β†’ 128 β†’ 128 β†’ 128 β†’ 128 |
53
  | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 128 |
54
  | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
55
- | Bottleneck | GRU(1152 β†’ 576) + Linear(576 β†’ 1152) |
56
  | Decoder | 5 sub-pixel conv + BN blocks: 128 β†’ 128 β†’ 128 β†’ 64 β†’ 64 |
57
- | CCM | 27-ch β†’ 3Γ—3 complex convolving mask (real-valued arithmetic) |
58
  | Parameters | ~8.0 M |
59
 
60
- ## Citation
61
 
62
- ```bibtex
63
- @inproceedings{indenbom2023localvqe,
64
- title={{LocalVQE}: Real Time Deep Voice Quality Enhancement},
65
- author={Indenbom, Evgenii and Beresford, Nicolae-C{\u a}t{\u a}lin and Adiban, Mohammad and Klej{\v c}ek, Ji{\v r}{\' i}},
66
- booktitle={Interspeech},
67
- year={2023}
68
- }
69
- ```
 
11
 
12
  # LocalVQE β€” Local Voice Quality Enhancement
13
 
14
+ Real-time joint acoustic echo cancellation (AEC), noise suppression (NS),
15
+ and dereverberation for 16 kHz speech.
 
 
 
 
16
 
17
+ **Derivative of DeepVQE.** The overall architecture (mic encoder, far-end
18
+ encoder, alignment cross-attention, bottleneck, decoder, complex convolving
19
+ mask) is based on **DeepVQE** (Indenbom et al., Interspeech 2023 β€” *DeepVQE:
20
+ Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation,
21
+ Noise Suppression and Dereverberation*,
22
+ [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). This fork replaces
23
+ STFT with a DCT-II analysis/synthesis filterbank for quantisation-aware
24
+ local CPU/GPU inference, swaps the GRU bottleneck for S4D, and ships a
25
+ streaming GGML graph so the model can run on an F32 CPU pipeline in under
26
+ 2 ms/frame at 16 kHz.
27
+
28
+ **There is no LocalVQE paper.** Everything specific to this project
29
+ (DCT-II in-graph, GGML runtime, training pipeline, AEC fine-tune recipe)
30
+ is original to this repo.
31
+
32
+ **Authors:**
33
+ - Richard Palethorpe ([richiejp](https://github.com/richiejp))
34
+ - Claude (Anthropic)
35
+
36
+ **Project source:** <https://github.com/richiejp/LocalVQE>
37
 
38
  ## Files
39
 
 
47
  ## Usage (GGML / C++ / Go)
48
 
49
  ```bash
 
50
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
 
 
51
  ./build/bin/localvqe localvqe-baseline.gguf \
52
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
53
  ```
 
58
  ## Architecture
59
 
60
  | Component | Value |
61
+ |---|---|
62
  | Sample rate | 16 kHz |
63
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
64
  | Mic encoder | 5 blocks: 2 β†’ 64 β†’ 128 β†’ 128 β†’ 128 β†’ 128 |
65
  | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 128 |
66
  | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
67
+ | Bottleneck | S4D diagonal state-space |
68
  | Decoder | 5 sub-pixel conv + BN blocks: 128 β†’ 128 β†’ 128 β†’ 64 β†’ 64 |
69
+ | CCM | 27-ch β†’ 3x3 complex convolving mask (real-valued arithmetic) |
70
  | Parameters | ~8.0 M |
71
 
72
+ ## Licence
73
 
74
+ Apache-2.0.