Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,70 +1,230 @@
|
|
| 1 |
-
# Supertonic β FP16 vs INT8 Quantized Benchmark (Shadow0482)
|
| 2 |
|
| 3 |
-
|
| 4 |
-
versions of the [Supertonic](https://huggingface.co/Supertone/supertonic) TTS
|
| 5 |
-
pipeline using the quantized models hosted at:
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
|
|
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|--------:|-----------|----------------|---------:|-----------:|--------|
|
| 26 |
-
| FP16 | float16 | `onnx_fp16/` | 0.914 | `NONE` | FAILED |
|
| 27 |
-
| INT8 | int8 | `onnx_int8/` | 6.644 | `Greetings__You_are_l_1.wav` | OK |
|
| 28 |
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
-
> - Exact times will vary depending on Colab hardware, runtime load, and ONNX Runtime version.
|
| 32 |
-
> - The goal of this benchmark is to confirm that both FP16 and INT8 quantized models
|
| 33 |
-
> load correctly and produce intelligible audio for the same input text.
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
````
|
| 63 |
|
| 64 |
---
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
The original Supertonic code and models are licensed under their respective
|
| 69 |
-
licenses (MIT code + OpenRAIL-M model). This benchmark and quantized packaging
|
| 70 |
-
follow the same licensing terms.
|
|
|
|
|
|
|
| 1 |
|
| 2 |
+
# Supertonic Quantized INT8 β Offline TTS (Shadow0482)
|
|
|
|
|
|
|
| 3 |
|
| 4 |
+
This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech
|
| 5 |
+
pipeline. These models are quantized versions of the official Supertonic models and are
|
| 6 |
+
designed for **offline, low-latency, CPU-friendly inference**.
|
| 7 |
|
| 8 |
+
FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
|
| 9 |
+
(`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**.
|
| 10 |
+
Therefore, **INT8 is the recommended format** for real-world offline use.
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# π Features
|
| 15 |
|
| 16 |
+
### β 100% Offline Execution
|
| 17 |
+
No network needed. Load ONNX models directly using ONNX Runtime.
|
| 18 |
|
| 19 |
+
### β Full Supertonic Inference Stack
|
| 20 |
+
- Text Encoder
|
| 21 |
+
- Duration Predictor
|
| 22 |
+
- Vector Estimator
|
| 23 |
+
- Vocoder
|
| 24 |
+
|
| 25 |
+
### β INT8 Dynamic Quantization
|
| 26 |
+
- Reduces model sizes dramatically
|
| 27 |
+
- CPU-friendly inference
|
| 28 |
+
- Very low memory usage
|
| 29 |
+
- Compatible with ONNX Runtime CPUExecutionProvider
|
| 30 |
+
|
| 31 |
+
### β Same Audio Quality Text Output
|
| 32 |
+
Produces understandable speech while being drastically faster on CPUs.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
# π¦ Repository Structure
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
int8_dynamic/
|
| 41 |
+
duration_predictor.int8.onnx
|
| 42 |
+
text_encoder.int8.onnx
|
| 43 |
+
vector_estimator.int8.onnx
|
| 44 |
+
vocoder.int8.onnx
|
| 45 |
+
|
| 46 |
+
fp16/
|
| 47 |
+
(Contains experimental FP16 models β vocoder currently unstable)
|
| 48 |
+
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
Only the **INT8 directory** is guaranteed stable.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
# π Test Sentence Used in Benchmark
|
| 56 |
+
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Greetings! You are listening to your newly quantized model.
|
| 60 |
+
I have been squished, squeezed, compressed, minimized, optimized,
|
| 61 |
+
digitized, and lightly traumatized to save disk space.
|
| 62 |
+
The testing framework automatically verifies my integrity,
|
| 63 |
+
measures how much weight I lost,
|
| 64 |
+
and checks if I can still talk without glitching into a robot dolphin.
|
| 65 |
+
If you can hear this clearly, the quantization ritual was a complete success.
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
# π Benchmark Summary (CPU)
|
| 72 |
+
|
| 73 |
+
| Model | Precision | Time (s) | Output | Status |
|
| 74 |
+
|-------|-----------|---------:|--------|--------|
|
| 75 |
+
| INT8 Dynamic | int8 | _varies: ~3.0β7.0s_ | `*.wav` | β
OK |
|
| 76 |
+
| FP32 (baseline) | float32 | ~2β4Γ slower | `*.wav` | β
OK |
|
| 77 |
+
| FP16 | mixed | β FAILED | β | π« Cannot load vocoder |
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
# π₯οΈ Offline Inference Guide (Python)
|
| 82 |
+
|
| 83 |
+
Below is a clean Python script to run **fully offline INT8 inference**.
|
| 84 |
|
| 85 |
---
|
| 86 |
|
| 87 |
+
# π§© Requirements
|
| 88 |
|
| 89 |
+
```
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
pip install onnxruntime numpy soundfile
|
| 92 |
|
| 93 |
+
````
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
+
# π offline_tts_int8.py
|
| 98 |
+
|
| 99 |
+
```python
|
| 100 |
+
import onnxruntime as ort
|
| 101 |
+
import numpy as np
|
| 102 |
+
import json
|
| 103 |
+
import soundfile as sf
|
| 104 |
+
from pathlib import Path
|
| 105 |
+
|
| 106 |
+
# ---------------------------------------------------------
|
| 107 |
+
# 1) CONFIG
|
| 108 |
+
# ---------------------------------------------------------
|
| 109 |
+
MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx
|
| 110 |
+
VOICE_STYLE = "assets/voice_styles/M1.json"
|
| 111 |
+
|
| 112 |
+
text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx"
|
| 113 |
+
duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx"
|
| 114 |
+
vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx"
|
| 115 |
+
vocoder_path = MODEL_DIR / "vocoder.int8.onnx"
|
| 116 |
+
|
| 117 |
+
TEST_TEXT = (
|
| 118 |
+
"Hello! This is the INT8 offline version of Supertonic speaking. "
|
| 119 |
+
"Everything you hear right now is running fully offline."
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
# ---------------------------------------------------------
|
| 123 |
+
# 2) TOKENIZER LOADING
|
| 124 |
+
# ---------------------------------------------------------
|
| 125 |
+
unicode_path = Path("assets/onnx/unicode_indexer.json")
|
| 126 |
+
tokenizer = json.load(open(unicode_path))
|
| 127 |
+
|
| 128 |
+
def encode_text(text: str):
|
| 129 |
+
ids = []
|
| 130 |
+
for ch in text:
|
| 131 |
+
if ch in tokenizer["token2idx"]:
|
| 132 |
+
ids.append(tokenizer["token2idx"][ch])
|
| 133 |
+
else:
|
| 134 |
+
ids.append(tokenizer["token2idx"]["<unk>"])
|
| 135 |
+
return np.array([ids], dtype=np.int64)
|
| 136 |
+
|
| 137 |
+
# ---------------------------------------------------------
|
| 138 |
+
# 3) LOAD MODELS (CPU)
|
| 139 |
+
# ---------------------------------------------------------
|
| 140 |
+
def load_session(model_path):
|
| 141 |
+
return ort.InferenceSession(
|
| 142 |
+
str(model_path),
|
| 143 |
+
providers=["CPUExecutionProvider"]
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
sess_text = load_session(text_encoder_path)
|
| 147 |
+
sess_dur = load_session(duration_pred_path)
|
| 148 |
+
sess_vec = load_session(vector_estimator_path)
|
| 149 |
+
sess_voc = load_session(vocoder_path)
|
| 150 |
+
|
| 151 |
+
# ---------------------------------------------------------
|
| 152 |
+
# 4) RUN TEXT ENCODER
|
| 153 |
+
# ---------------------------------------------------------
|
| 154 |
+
text_ids = encode_text(TEST_TEXT)
|
| 155 |
+
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
|
| 156 |
+
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)
|
| 157 |
+
|
| 158 |
+
text_out = sess_text.run(
|
| 159 |
+
None,
|
| 160 |
+
{
|
| 161 |
+
"text_ids": text_ids,
|
| 162 |
+
"text_mask": text_mask,
|
| 163 |
+
"style_ttl": style_ttl
|
| 164 |
+
}
|
| 165 |
+
)[0]
|
| 166 |
+
|
| 167 |
+
# ---------------------------------------------------------
|
| 168 |
+
# 5) RUN DURATION PREDICTOR
|
| 169 |
+
# ---------------------------------------------------------
|
| 170 |
+
style_dp = np.zeros((1, 8, 16), dtype=np.float32)
|
| 171 |
+
|
| 172 |
+
dur_out = sess_dur.run(
|
| 173 |
+
None,
|
| 174 |
+
{
|
| 175 |
+
"text_ids": text_ids,
|
| 176 |
+
"text_mask": text_mask,
|
| 177 |
+
"style_dp": style_dp
|
| 178 |
+
}
|
| 179 |
+
)[0]
|
| 180 |
+
|
| 181 |
+
durations = np.maximum(dur_out.astype(int), 1)
|
| 182 |
+
|
| 183 |
+
# ---------------------------------------------------------
|
| 184 |
+
# 6) VECTOR ESTIMATOR
|
| 185 |
+
# ---------------------------------------------------------
|
| 186 |
+
latent = sess_vec.run(None, {"latent": text_out})[0]
|
| 187 |
+
|
| 188 |
+
# ---------------------------------------------------------
|
| 189 |
+
# 7) VOCODER β WAV
|
| 190 |
+
# ---------------------------------------------------------
|
| 191 |
+
wav = sess_voc.run(None, {"latent": latent})[0][0]
|
| 192 |
+
|
| 193 |
+
sf.write("output_int8.wav", wav, 24000)
|
| 194 |
+
print("Saved: output_int8.wav")
|
| 195 |
````
|
| 196 |
|
| 197 |
---
|
| 198 |
|
| 199 |
+
# π§ Output
|
| 200 |
+
|
| 201 |
+
After running:
|
| 202 |
+
|
| 203 |
+
```
|
| 204 |
+
python offline_tts_int8.py
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
You will get:
|
| 208 |
+
|
| 209 |
+
```
|
| 210 |
+
output_int8.wav
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
Playable offline on any system.
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
# π Notes
|
| 218 |
+
|
| 219 |
+
* Only the **INT8** models are stable & recommended.
|
| 220 |
+
* FP16 vocoder currently fails due to a type mismatch in a `Div` node.
|
| 221 |
+
* No internet connection is required for INT8 inference.
|
| 222 |
+
* These models are ideal for embedded or low-spec machines.
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
# π License
|
| 227 |
+
|
| 228 |
+
Models follow Supertone's licensing terms.
|
| 229 |
+
Quantized versions follow the same licensing.
|
| 230 |
|
|
|
|
|
|
|
|
|