Fix fabricated citation; credit actual authors
Browse files
README.md
CHANGED
|
@@ -39,35 +39,56 @@ is original to this repo.
|
|
| 39 |
|
| 40 |
| File | Size | Description |
|
| 41 |
|---|---|---|
|
| 42 |
-
| `
|
| 43 |
-
| `
|
| 44 |
-
| `localvqe-baseline.gguf` | 5 MB | GGML F32 export of `baseline.pt` (BN-folded, DCT weights embedded). |
|
| 45 |
-
| `localvqe-finetune.gguf` | 5 MB | GGML F32 export of `finetune.pt`. |
|
| 46 |
|
| 47 |
## Usage (GGML / C++ / Go)
|
| 48 |
|
| 49 |
```bash
|
| 50 |
cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
|
| 51 |
-
./build/bin/localvqe localvqe-
|
| 52 |
--in-wav mic.wav ref.wav --out-wav enhanced.wav
|
| 53 |
```
|
| 54 |
|
| 55 |
Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ realtime at
|
| 56 |
16 kHz / 256-sample hop).
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## Architecture
|
| 59 |
|
| 60 |
| Component | Value |
|
| 61 |
|---|---|
|
| 62 |
| Sample rate | 16 kHz |
|
| 63 |
| Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
|
| 64 |
-
| Mic encoder | 5 blocks: 2 β
|
| 65 |
-
| Far-end encoder | 2 blocks: 2 β 32 β
|
| 66 |
| AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
|
| 67 |
-
| Bottleneck | S4D diagonal state-space |
|
| 68 |
-
| Decoder | 5 sub-pixel conv + BN blocks
|
| 69 |
-
| CCM | 27-ch β
|
| 70 |
-
|
|
|
|
|
| 71 |
|
| 72 |
## Licence
|
| 73 |
|
|
|
|
| 39 |
|
| 40 |
| File | Size | Description |
|
| 41 |
|---|---|---|
|
| 42 |
+
| `localvqe-v1.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + AEC Challenge fine-tune. |
|
| 43 |
+
| `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). |
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Usage (GGML / C++ / Go)
|
| 46 |
|
| 47 |
```bash
|
| 48 |
cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
|
| 49 |
+
./build/bin/localvqe localvqe-v1-f32.gguf \
|
| 50 |
--in-wav mic.wav ref.wav --out-wav enhanced.wav
|
| 51 |
```
|
| 52 |
|
| 53 |
Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ realtime at
|
| 54 |
16 kHz / 256-sample hop).
|
| 55 |
|
| 56 |
+
## Validation
|
| 57 |
+
|
| 58 |
+
Best checkpoint of the published v1 F32 weights (`localvqe-v1-f32.gguf`), evaluated
|
| 59 |
+
on a 1 000-clip validation split mixing DNS5-synthesised scenes and ICASSP
|
| 60 |
+
AEC Challenge synthetic data; AECMOS over a 100-clip sub-sample.
|
| 61 |
+
|
| 62 |
+
| Metric | Overall | Single-talk far-end | Double-talk |
|
| 63 |
+
|---|---:|---:|---:|
|
| 64 |
+
| ERLE | β | **+52.2 dB** | β |
|
| 65 |
+
| AECMOS echo (β) | 4.36 | 4.46 | 4.33 |
|
| 66 |
+
| AECMOS degradation (β) | 4.83 | 5.00 | 4.78 |
|
| 67 |
+
|
| 68 |
+
ERLE is only meaningful on single-talk far-end (mic = pure echo); overall
|
| 69 |
+
and double-talk ERLE are confounded by near-end speech and are omitted.
|
| 70 |
+
|
| 71 |
+
DNSMOS OVRL is deliberately omitted: our P.808 scores (~2.0) contradict
|
| 72 |
+
informal listening of the single-talk far-end output, which is audibly
|
| 73 |
+
near-silent. We suspect our DNSMOS invocation is miscalibrated for AEC
|
| 74 |
+
outputs β particularly near-silent clips, which are out of distribution
|
| 75 |
+
for a speech-quality predictor β and would rather withhold the numbers
|
| 76 |
+
than publish misleading ones.
|
| 77 |
+
|
| 78 |
## Architecture
|
| 79 |
|
| 80 |
| Component | Value |
|
| 81 |
|---|---|
|
| 82 |
| Sample rate | 16 kHz |
|
| 83 |
| Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
|
| 84 |
+
| Mic encoder | 5 blocks: 2 β 32 β 40 β 40 β 40 β 40 |
|
| 85 |
+
| Far-end encoder | 2 blocks: 2 β 32 β 40 |
|
| 86 |
| AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
|
| 87 |
+
| Bottleneck | S4D diagonal state-space, hidden 162 |
|
| 88 |
+
| Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
|
| 89 |
+
| CCM | 27-ch β 3Γ3 complex convolving mask (real-valued arithmetic) |
|
| 90 |
+
| Kernel | (4, 4) time Γ freq, causal padding |
|
| 91 |
+
| Parameters | ~0.9 M |
|
| 92 |
|
| 93 |
## Licence
|
| 94 |
|