richiejp commited on
Commit
cf2ce52
Β·
verified Β·
1 Parent(s): 88876af

Fix fabricated citation; credit actual authors

Browse files
Files changed (1) hide show
  1. README.md +32 -11
README.md CHANGED
@@ -39,35 +39,56 @@ is original to this repo.
39
 
40
  | File | Size | Description |
41
  |---|---|---|
42
- | `baseline.pt` | 11 MB | DNS5-trained baseline (val_loss 0.0213, ERLE 14.9 dB). |
43
- | `finetune.pt` | 11 MB | AEC fine-tune (1 epoch on AEC Challenge synthetic set). |
44
- | `localvqe-baseline.gguf` | 5 MB | GGML F32 export of `baseline.pt` (BN-folded, DCT weights embedded). |
45
- | `localvqe-finetune.gguf` | 5 MB | GGML F32 export of `finetune.pt`. |
46
 
47
  ## Usage (GGML / C++ / Go)
48
 
49
  ```bash
50
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
51
- ./build/bin/localvqe localvqe-baseline.gguf \
52
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
53
  ```
54
 
55
  Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
56
  16 kHz / 256-sample hop).
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Architecture
59
 
60
  | Component | Value |
61
  |---|---|
62
  | Sample rate | 16 kHz |
63
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
64
- | Mic encoder | 5 blocks: 2 β†’ 64 β†’ 128 β†’ 128 β†’ 128 β†’ 128 |
65
- | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 128 |
66
  | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
67
- | Bottleneck | S4D diagonal state-space |
68
- | Decoder | 5 sub-pixel conv + BN blocks: 128 β†’ 128 β†’ 128 β†’ 64 β†’ 64 |
69
- | CCM | 27-ch β†’ 3x3 complex convolving mask (real-valued arithmetic) |
70
- | Parameters | ~8.0 M |
 
71
 
72
  ## Licence
73
 
 
39
 
40
  | File | Size | Description |
41
  |---|---|---|
42
+ | `localvqe-v1.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + AEC Challenge fine-tune. |
43
+ | `localvqe-v1-f32.gguf` | 5 MB | GGML F32 export (BN-folded, DCT weights embedded). |
 
 
44
 
45
  ## Usage (GGML / C++ / Go)
46
 
47
  ```bash
48
  cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
49
+ ./build/bin/localvqe localvqe-v1-f32.gguf \
50
  --in-wav mic.wav ref.wav --out-wav enhanced.wav
51
  ```
52
 
53
  Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
54
  16 kHz / 256-sample hop).
55
 
56
+ ## Validation
57
+
58
+ Best checkpoint of the published v1 F32 weights (`localvqe-v1-f32.gguf`), evaluated
59
+ on a 1 000-clip validation split mixing DNS5-synthesised scenes and ICASSP
60
+ AEC Challenge synthetic data; AECMOS over a 100-clip sub-sample.
61
+
62
+ | Metric | Overall | Single-talk far-end | Double-talk |
63
+ |---|---:|---:|---:|
64
+ | ERLE | β€” | **+52.2 dB** | β€” |
65
+ | AECMOS echo (↑) | 4.36 | 4.46 | 4.33 |
66
+ | AECMOS degradation (↑) | 4.83 | 5.00 | 4.78 |
67
+
68
+ ERLE is only meaningful on single-talk far-end (mic = pure echo); overall
69
+ and double-talk ERLE are confounded by near-end speech and are omitted.
70
+
71
+ DNSMOS OVRL is deliberately omitted: our P.808 scores (~2.0) contradict
72
+ informal listening of the single-talk far-end output, which is audibly
73
+ near-silent. We suspect our DNSMOS invocation is miscalibrated for AEC
74
+ outputs β€” particularly near-silent clips, which are out of distribution
75
+ for a speech-quality predictor β€” and would rather withhold the numbers
76
+ than publish misleading ones.
77
+
78
  ## Architecture
79
 
80
  | Component | Value |
81
  |---|---|
82
  | Sample rate | 16 kHz |
83
  | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
84
+ | Mic encoder | 5 blocks: 2 β†’ 32 β†’ 40 β†’ 40 β†’ 40 β†’ 40 |
85
+ | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 40 |
86
  | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
87
+ | Bottleneck | S4D diagonal state-space, hidden 162 |
88
+ | Decoder | 5 sub-pixel conv + BN blocks, mirroring encoder |
89
+ | CCM | 27-ch β†’ 3Γ—3 complex convolving mask (real-valued arithmetic) |
90
+ | Kernel | (4, 4) time Γ— freq, causal padding |
91
+ | Parameters | ~0.9 M |
92
 
93
  ## Licence
94