Sync model card with upstream GitHub inference README
Browse files
README.md
CHANGED
|
@@ -114,30 +114,9 @@ friendly, but quantized weights have not yet been calibrated and released.
|
|
| 114 |
|
| 115 |
## Validation Results
|
| 116 |
|
| 117 |
-
### Synthetic validation split
|
| 118 |
-
|
| 119 |
-
500 clips drawn from DNS5 + AEC Challenge synthetic data, stratified
|
| 120 |
-
across 15 scenario cells. AECMOS over a 100-clip sub-sample per the
|
| 121 |
-
standard AEC Challenge protocol.
|
| 122 |
-
|
| 123 |
-
| Metric | Value |
|
| 124 |
-
|---|---:|
|
| 125 |
-
| ERLE (dB) | 11.4 |
|
| 126 |
-
| AECMOS echo (↑, 1–5) | 3.83 |
|
| 127 |
-
| AECMOS degradation (↑, 1–5) | 4.04 |
|
| 128 |
-
|
| 129 |
-
- **ERLE** (Echo Return Loss Enhancement) — `10·log10(E[mic²] / E[enh²])`
|
| 130 |
-
averaged across scenarios. On scenes with active near-end speech both
|
| 131 |
-
numerator and denominator are dominated by speech, so the absolute
|
| 132 |
-
value understates echo-only removal.
|
| 133 |
-
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 134 |
-
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
| 135 |
-
rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
|
| 136 |
-
|
| 137 |
-
### AEC Challenge 2022 blind set (real recordings)
|
| 138 |
-
|
| 139 |
Stratified 150-sample eval (30 per scenario) on the
|
| 140 |
-
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
|
|
|
| 141 |
|
| 142 |
| Scenario | AECMOS echo | AECMOS deg | blind ERLE |
|
| 143 |
|---|---:|---:|---:|
|
|
@@ -147,18 +126,13 @@ Stratified 150-sample eval (30 per scenario) on the
|
|
| 147 |
| farend-singletalk-with-movement | 4.26 | 4.82 | 48.2 dB |
|
| 148 |
| nearend-singletalk | 4.95 | 3.98 | 4.2 dB |
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
*
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
silence handling, or ONNX model variant) is miscalibrated for AEC outputs
|
| 158 |
-
and in particular for near-silent clips, which are out of distribution for a
|
| 159 |
-
speech-quality predictor. Until we can reconcile the numbers with a
|
| 160 |
-
DeepVQE-matching protocol we consider our OVRL numbers untrustworthy and
|
| 161 |
-
omit them rather than publish misleading figures.
|
| 162 |
|
| 163 |
## Architecture
|
| 164 |
|
|
|
|
| 114 |
|
| 115 |
## Validation Results
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
Stratified 150-sample eval (30 per scenario) on the
|
| 118 |
+
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
|
| 119 |
+
— real recordings, not synthetic mixes.
|
| 120 |
|
| 121 |
| Scenario | AECMOS echo | AECMOS deg | blind ERLE |
|
| 122 |
|---|---:|---:|---:|
|
|
|
|
| 126 |
| farend-singletalk-with-movement | 4.26 | 4.82 | 48.2 dB |
|
| 127 |
| nearend-singletalk | 4.95 | 3.98 | 4.2 dB |
|
| 128 |
|
| 129 |
+
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
|
| 130 |
+
quality predictor. "Echo" rates how well echo was removed; "degradation"
|
| 131 |
+
rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
|
| 132 |
+
- **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
|
| 133 |
+
far-end single-talk where the input is echo-only; on scenes with active
|
| 134 |
+
near-end speech it understates echo removal because both numerator and
|
| 135 |
+
denominator are dominated by speech.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
## Architecture
|
| 138 |
|