richiejp commited on
Commit
d15c8fe
·
verified ·
1 Parent(s): 29a5a0d

Sync model card with upstream GitHub inference README

Browse files
Files changed (1) hide show
  1. README.md +9 -35
README.md CHANGED
@@ -114,30 +114,9 @@ friendly, but quantized weights have not yet been calibrated and released.
114
 
115
  ## Validation Results
116
 
117
- ### Synthetic validation split
118
-
119
- 500 clips drawn from DNS5 + AEC Challenge synthetic data, stratified
120
- across 15 scenario cells. AECMOS over a 100-clip sub-sample per the
121
- standard AEC Challenge protocol.
122
-
123
- | Metric | Value |
124
- |---|---:|
125
- | ERLE (dB) | 11.4 |
126
- | AECMOS echo (↑, 1–5) | 3.83 |
127
- | AECMOS degradation (↑, 1–5) | 4.04 |
128
-
129
- - **ERLE** (Echo Return Loss Enhancement) — `10·log10(E[mic²] / E[enh²])`
130
- averaged across scenarios. On scenes with active near-end speech both
131
- numerator and denominator are dominated by speech, so the absolute
132
- value understates echo-only removal.
133
- - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
134
- quality predictor. "Echo" rates how well echo was removed; "degradation"
135
- rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
136
-
137
- ### AEC Challenge 2022 blind set (real recordings)
138
-
139
  Stratified 150-sample eval (30 per scenario) on the
140
- [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge).
 
141
 
142
  | Scenario | AECMOS echo | AECMOS deg | blind ERLE |
143
  |---|---:|---:|---:|
@@ -147,18 +126,13 @@ Stratified 150-sample eval (30 per scenario) on the
147
  | farend-singletalk-with-movement | 4.26 | 4.82 | 48.2 dB |
148
  | nearend-singletalk | 4.95 | 3.98 | 4.2 dB |
149
 
150
- ### Why DNSMOS OVRL is not reported here
151
-
152
- We track DNSMOS P.808 (`sig_bak_ovr.onnx`) in TensorBoard but are deliberately
153
- *not* publishing OVRL numbers for this model. The scores we obtain (around 2.0
154
- overall, 2.1 on single-talk far-end) contradict informal listening
155
- single-talk far-end with ~48 dB of cancellation is audibly near-silent, not a
156
- "2-out-of-5" output. We suspect our DNSMOS invocation (input normalisation,
157
- silence handling, or ONNX model variant) is miscalibrated for AEC outputs
158
- and in particular for near-silent clips, which are out of distribution for a
159
- speech-quality predictor. Until we can reconcile the numbers with a
160
- DeepVQE-matching protocol we consider our OVRL numbers untrustworthy and
161
- omit them rather than publish misleading figures.
162
 
163
  ## Architecture
164
 
 
114
 
115
  ## Validation Results
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  Stratified 150-sample eval (30 per scenario) on the
118
+ [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
119
+ — real recordings, not synthetic mixes.
120
 
121
  | Scenario | AECMOS echo | AECMOS deg | blind ERLE |
122
  |---|---:|---:|---:|
 
126
  | farend-singletalk-with-movement | 4.26 | 4.82 | 48.2 dB |
127
  | nearend-singletalk | 4.95 | 3.98 | 4.2 dB |
128
 
129
+ - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
130
+ quality predictor. "Echo" rates how well echo was removed; "degradation"
131
+ rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
132
+ - **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
133
+ far-end single-talk where the input is echo-only; on scenes with active
134
+ near-end speech it understates echo removal because both numerator and
135
+ denominator are dominated by speech.
 
 
 
 
 
136
 
137
  ## Architecture
138