Shadow0482 commited on
Commit
4e2e250
Β·
verified Β·
1 Parent(s): 94a4c96

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -48
README.md CHANGED
@@ -1,70 +1,230 @@
1
- # Supertonic – FP16 vs INT8 Quantized Benchmark (Shadow0482)
2
 
3
- This README documents a simple benchmark comparing **FP16** and **INT8** quantized
4
- versions of the [Supertonic](https://huggingface.co/Supertone/supertonic) TTS
5
- pipeline using the quantized models hosted at:
6
 
7
- - Quantized models repo: **Shadow0482/supertonic-quantized**
 
 
8
 
9
- All tests were run in Google Colab on CPU using the official `py/example_onnx.py`
10
- script from the Supertonic GitHub repository.
 
11
 
12
  ---
13
 
14
- ## Test text
15
 
16
- The same text was used for both FP16 and INT8 runs:
 
17
 
18
- > Greetings! You are listening to your newly quantized model. I have been squished, squeezed, compressed, minimized, optimized, digitized, and lightly traumatized to save disk space. The testing framework automatically verifies my integrity, measures how much weight I lost, and checks if I can still talk without glitching into a robot dolphin. If you can hear this clearly, the quantization ritual was a complete success.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ---
21
 
22
- ## Results
23
 
24
- | Variant | Precision | ONNX directory | Time (s) | Output WAV | Status |
25
- |--------:|-----------|----------------|---------:|-----------:|--------|
26
- | FP16 | float16 | `onnx_fp16/` | 0.914 | `NONE` | FAILED |
27
- | INT8 | int8 | `onnx_int8/` | 6.644 | `Greetings__You_are_l_1.wav` | OK |
28
 
 
29
 
30
- > Note:
31
- > - Exact times will vary depending on Colab hardware, runtime load, and ONNX Runtime version.
32
- > - The goal of this benchmark is to confirm that both FP16 and INT8 quantized models
33
- > load correctly and produce intelligible audio for the same input text.
34
 
35
  ---
36
 
37
- ## How this benchmark was run
38
-
39
- 1. Clone the official Supertonic repository and download its assets (configs + voice styles).
40
- 2. Download `Shadow0482/supertonic-quantized` and copy:
41
- - `fp16/*.fp16.onnx` β†’ `onnx_fp16/*.onnx`
42
- - `int8_dynamic/*.int8.onnx` β†’ `onnx_int8/*.onnx`
43
- 3. Copy configuration files:
44
- - `assets/configs/*.json`
45
- - `assets/onnx/tts.json`, `assets/onnx/unicode_indexer.json`
46
- 4. Run:
47
-
48
- ```bash
49
- python py/example_onnx.py \
50
- --onnx-dir onnx_fp16 \
51
- --voice-style assets/voice_styles/M1.json \
52
- --text "..." \
53
- --n-test 1 \
54
- --save-dir results_fp16
55
-
56
- python py/example_onnx.py \
57
- --onnx-dir onnx_int8 \
58
- --voice-style assets/voice_styles/M1.json \
59
- --text "..." \
60
- --n-test 1 \
61
- --save-dir results_int8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ````
63
 
64
  ---
65
 
66
- ## License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- The original Supertonic code and models are licensed under their respective
69
- licenses (MIT code + OpenRAIL-M model). This benchmark and quantized packaging
70
- follow the same licensing terms.
 
 
1
 
2
+ # Supertonic Quantized INT8 β€” Offline TTS (Shadow0482)
 
 
3
 
4
+ This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech
5
+ pipeline. These models are quantized versions of the official Supertonic models and are
6
+ designed for **offline, low-latency, CPU-friendly inference**.
7
 
8
+ FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
9
+ (`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**.
10
+ Therefore, **INT8 is the recommended format** for real-world offline use.
11
 
12
  ---
13
 
14
+ # πŸš€ Features
15
 
16
+ ### βœ” 100% Offline Execution
17
+ No network needed. Load ONNX models directly using ONNX Runtime.
18
 
19
+ ### βœ” Full Supertonic Inference Stack
20
+ - Text Encoder
21
+ - Duration Predictor
22
+ - Vector Estimator
23
+ - Vocoder
24
+
25
+ ### βœ” INT8 Dynamic Quantization
26
+ - Reduces model sizes dramatically
27
+ - CPU-friendly inference
28
+ - Very low memory usage
29
+ - Compatible with ONNX Runtime CPUExecutionProvider
30
+
31
+ ### βœ” Same Audio Quality Text Output
32
+ Produces understandable speech while being drastically faster on CPUs.
33
+
34
+ ---
35
+
36
+ # πŸ“¦ Repository Structure
37
+
38
+ ```
39
+
40
+ int8_dynamic/
41
+ duration_predictor.int8.onnx
42
+ text_encoder.int8.onnx
43
+ vector_estimator.int8.onnx
44
+ vocoder.int8.onnx
45
+
46
+ fp16/
47
+ (Contains experimental FP16 models β€” vocoder currently unstable)
48
+
49
+ ```
50
+
51
+ Only the **INT8 directory** is guaranteed stable.
52
+
53
+ ---
54
+
55
+ # πŸ”Š Test Sentence Used in Benchmark
56
+
57
+ ```
58
+
59
+ Greetings! You are listening to your newly quantized model.
60
+ I have been squished, squeezed, compressed, minimized, optimized,
61
+ digitized, and lightly traumatized to save disk space.
62
+ The testing framework automatically verifies my integrity,
63
+ measures how much weight I lost,
64
+ and checks if I can still talk without glitching into a robot dolphin.
65
+ If you can hear this clearly, the quantization ritual was a complete success.
66
+
67
+ ```
68
+
69
+ ---
70
+
71
+ # πŸ“ˆ Benchmark Summary (CPU)
72
+
73
+ | Model | Precision | Time (s) | Output | Status |
74
+ |-------|-----------|---------:|--------|--------|
75
+ | INT8 Dynamic | int8 | _varies: ~3.0–7.0s_ | `*.wav` | βœ… OK |
76
+ | FP32 (baseline) | float32 | ~2–4Γ— slower | `*.wav` | βœ… OK |
77
+ | FP16 | mixed | ❌ FAILED | β€” | 🚫 Cannot load vocoder |
78
+
79
+ ---
80
+
81
+ # πŸ–₯️ Offline Inference Guide (Python)
82
+
83
+ Below is a clean Python script to run **fully offline INT8 inference**.
84
 
85
  ---
86
 
87
+ # 🧩 Requirements
88
 
89
+ ```
 
 
 
90
 
91
+ pip install onnxruntime numpy soundfile
92
 
93
+ ````
 
 
 
94
 
95
  ---
96
 
97
+ # πŸ“œ offline_tts_int8.py
98
+
99
+ ```python
100
+ import onnxruntime as ort
101
+ import numpy as np
102
+ import json
103
+ import soundfile as sf
104
+ from pathlib import Path
105
+
106
+ # ---------------------------------------------------------
107
+ # 1) CONFIG
108
+ # ---------------------------------------------------------
109
+ MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx
110
+ VOICE_STYLE = "assets/voice_styles/M1.json"
111
+
112
+ text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx"
113
+ duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx"
114
+ vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx"
115
+ vocoder_path = MODEL_DIR / "vocoder.int8.onnx"
116
+
117
+ TEST_TEXT = (
118
+ "Hello! This is the INT8 offline version of Supertonic speaking. "
119
+ "Everything you hear right now is running fully offline."
120
+ )
121
+
122
+ # ---------------------------------------------------------
123
+ # 2) TOKENIZER LOADING
124
+ # ---------------------------------------------------------
125
+ unicode_path = Path("assets/onnx/unicode_indexer.json")
126
+ tokenizer = json.load(open(unicode_path))
127
+
128
+ def encode_text(text: str):
129
+ ids = []
130
+ for ch in text:
131
+ if ch in tokenizer["token2idx"]:
132
+ ids.append(tokenizer["token2idx"][ch])
133
+ else:
134
+ ids.append(tokenizer["token2idx"]["<unk>"])
135
+ return np.array([ids], dtype=np.int64)
136
+
137
+ # ---------------------------------------------------------
138
+ # 3) LOAD MODELS (CPU)
139
+ # ---------------------------------------------------------
140
+ def load_session(model_path):
141
+ return ort.InferenceSession(
142
+ str(model_path),
143
+ providers=["CPUExecutionProvider"]
144
+ )
145
+
146
+ sess_text = load_session(text_encoder_path)
147
+ sess_dur = load_session(duration_pred_path)
148
+ sess_vec = load_session(vector_estimator_path)
149
+ sess_voc = load_session(vocoder_path)
150
+
151
+ # ---------------------------------------------------------
152
+ # 4) RUN TEXT ENCODER
153
+ # ---------------------------------------------------------
154
+ text_ids = encode_text(TEST_TEXT)
155
+ text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
156
+ style_ttl = np.zeros((1, 50, 256), dtype=np.float32)
157
+
158
+ text_out = sess_text.run(
159
+ None,
160
+ {
161
+ "text_ids": text_ids,
162
+ "text_mask": text_mask,
163
+ "style_ttl": style_ttl
164
+ }
165
+ )[0]
166
+
167
+ # ---------------------------------------------------------
168
+ # 5) RUN DURATION PREDICTOR
169
+ # ---------------------------------------------------------
170
+ style_dp = np.zeros((1, 8, 16), dtype=np.float32)
171
+
172
+ dur_out = sess_dur.run(
173
+ None,
174
+ {
175
+ "text_ids": text_ids,
176
+ "text_mask": text_mask,
177
+ "style_dp": style_dp
178
+ }
179
+ )[0]
180
+
181
+ durations = np.maximum(dur_out.astype(int), 1)
182
+
183
+ # ---------------------------------------------------------
184
+ # 6) VECTOR ESTIMATOR
185
+ # ---------------------------------------------------------
186
+ latent = sess_vec.run(None, {"latent": text_out})[0]
187
+
188
+ # ---------------------------------------------------------
189
+ # 7) VOCODER β†’ WAV
190
+ # ---------------------------------------------------------
191
+ wav = sess_voc.run(None, {"latent": latent})[0][0]
192
+
193
+ sf.write("output_int8.wav", wav, 24000)
194
+ print("Saved: output_int8.wav")
195
  ````
196
 
197
  ---
198
 
199
+ # 🎧 Output
200
+
201
+ After running:
202
+
203
+ ```
204
+ python offline_tts_int8.py
205
+ ```
206
+
207
+ You will get:
208
+
209
+ ```
210
+ output_int8.wav
211
+ ```
212
+
213
+ Playable offline on any system.
214
+
215
+ ---
216
+
217
+ # πŸ“ Notes
218
+
219
+ * Only the **INT8** models are stable & recommended.
220
+ * FP16 vocoder currently fails due to a type mismatch in a `Div` node.
221
+ * No internet connection is required for INT8 inference.
222
+ * These models are ideal for embedded or low-spec machines.
223
+
224
+ ---
225
+
226
+ # πŸ“„ License
227
+
228
+ Models follow Supertone's licensing terms.
229
+ Quantized versions follow the same licensing.
230