richiejp commited on
Commit
97bc2da
Β·
verified Β·
1 Parent(s): cf2ce52

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +27 -47
README.md CHANGED
@@ -11,29 +11,18 @@ license: apache-2.0
11
 
12
  # LocalVQE β€” Local Voice Quality Enhancement
13
 
14
- Real-time joint acoustic echo cancellation (AEC), noise suppression (NS),
15
- and dereverberation for 16 kHz speech.
 
 
 
 
 
16
 
17
- **Derivative of DeepVQE.** The overall architecture (mic encoder, far-end
18
- encoder, alignment cross-attention, bottleneck, decoder, complex convolving
19
- mask) is based on **DeepVQE** (Indenbom et al., Interspeech 2023 β€” *DeepVQE:
20
- Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation,
21
- Noise Suppression and Dereverberation*,
22
- [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)). This fork replaces
23
- STFT with a DCT-II analysis/synthesis filterbank for quantisation-aware
24
- local CPU/GPU inference, swaps the GRU bottleneck for S4D, and ships a
25
- streaming GGML graph so the model can run on an F32 CPU pipeline in under
26
- 2 ms/frame at 16 kHz.
27
 
28
- **There is no LocalVQE paper.** Everything specific to this project
29
- (DCT-II in-graph, GGML runtime, training pipeline, AEC fine-tune recipe)
30
- is original to this repo.
31
-
32
- **Authors:**
33
- - Richard Palethorpe ([richiejp](https://github.com/richiejp))
34
- - Claude (Anthropic)
35
-
36
- **Project source:** <https://github.com/richiejp/LocalVQE>
37
 
38
  ## Files
39
 
@@ -45,7 +34,10 @@ is original to this repo.
45
  ## Usage (GGML / C++ / Go)
46
 
47
  ```bash
 
48
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
 
 
49
  ./build/bin/localvqe localvqe-v1-f32.gguf \
50
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
51
  ```
@@ -53,32 +45,10 @@ cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
53
  Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
54
  16 kHz / 256-sample hop).
55
 
56
- ## Validation
57
-
58
- Best checkpoint of the published v1 F32 weights (`localvqe-v1-f32.gguf`), evaluated
59
- on a 1 000-clip validation split mixing DNS5-synthesised scenes and ICASSP
60
- AEC Challenge synthetic data; AECMOS over a 100-clip sub-sample.
61
-
62
- | Metric | Overall | Single-talk far-end | Double-talk |
63
- |---|---:|---:|---:|
64
- | ERLE | β€” | **+52.2 dB** | β€” |
65
- | AECMOS echo (↑) | 4.36 | 4.46 | 4.33 |
66
- | AECMOS degradation (↑) | 4.83 | 5.00 | 4.78 |
67
-
68
- ERLE is only meaningful on single-talk far-end (mic = pure echo); overall
69
- and double-talk ERLE are confounded by near-end speech and are omitted.
70
-
71
- DNSMOS OVRL is deliberately omitted: our P.808 scores (~2.0) contradict
72
- informal listening of the single-talk far-end output, which is audibly
73
- near-silent. We suspect our DNSMOS invocation is miscalibrated for AEC
74
- outputs β€” particularly near-silent clips, which are out of distribution
75
- for a speech-quality predictor β€” and would rather withhold the numbers
76
- than publish misleading ones.
77
-
78
  ## Architecture
79
 
80
  | Component | Value |
81
- |---|---|
82
  | Sample rate | 16 kHz |
83
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
84
  | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
@@ -90,6 +60,16 @@ than publish misleading ones.
90
  | Kernel | (4, 4) time Γ— freq, causal padding |
91
  | Parameters | ~0.9 M |
92
 
93
- ## Licence
94
-
95
- Apache-2.0.
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # LocalVQE β€” Local Voice Quality Enhancement
13
 
14
+ Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
15
+ dereverberation for 16 kHz speech. A from-scratch derivative of **DeepVQE**
16
+ (Indenbom et al., Interspeech 2023 β€” *DeepVQE: Real Time Deep Voice Quality
17
+ Enhancement*, [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned
18
+ for quantization-aware local CPU/GPU inference. The DCT-II analysis/synthesis
19
+ (replacing STFT), S4D bottleneck, GGML streaming graph, and training pipeline
20
+ are work of this project β€” no paper yet.
21
 
22
+ **Authors:** Richard Palethorpe ([richiejp](https://github.com/richiejp)) and
23
+ Claude (Anthropic).
 
 
 
 
 
 
 
 
24
 
25
+ Project source: <https://github.com/richiejp/LocalVQE>
 
 
 
 
 
 
 
 
26
 
27
  ## Files
28
 
 
34
  ## Usage (GGML / C++ / Go)
35
 
36
  ```bash
37
+ # Build the ggml binary
38
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
39
+
40
+ # Run inference on a 16 kHz WAV pair
41
  ./build/bin/localvqe localvqe-v1-f32.gguf \
42
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
43
  ```
 
45
  Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
46
  16 kHz / 256-sample hop).
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ## Architecture
49
 
50
  | Component | Value |
51
+ |-----------|-------|
52
  | Sample rate | 16 kHz |
53
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
54
  | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
 
60
  | Kernel | (4, 4) time Γ— freq, causal padding |
61
  | Parameters | ~0.9 M |
62
 
63
+ ## Upstream citation (DeepVQE)
64
+
65
+ ```bibtex
66
+ @inproceedings{indenbom2023deepvqe,
67
+ title={{DeepVQE}: Real Time Deep Voice Quality Enhancement for Joint Acoustic
68
+ Echo Cancellation, Noise Suppression and Dereverberation},
69
+ author={Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u a}t{\u a}lin and
70
+ Chernov, Mykola and Aichner, Robert},
71
+ booktitle={Interspeech},
72
+ year={2023},
73
+ doi={10.21437/Interspeech.2023-2176}
74
+ }
75
+ ```