Shadow0482
/

supertonic-quantized

Text-to-Speech

ONNX

Model card Files Files and versions

xet

Community

Shadow0482 commited on Nov 23

Commit

4e2e250

verified ·

1 Parent(s): 94a4c96

Upload README.md

Browse files

Files changed (1) hide show

README.md +208 -48

README.md CHANGED Viewed

@@ -1,70 +1,230 @@
-# Supertonic – FP16 vs INT8 Quantized Benchmark (Shadow0482)
-This README documents a simple benchmark comparing **FP16** and **INT8** quantized
-versions of the [Supertonic](https://huggingface.co/Supertone/supertonic) TTS
-pipeline using the quantized models hosted at:
-- Quantized models repo: **Shadow0482/supertonic-quantized**
-All tests were run in Google Colab on CPU using the official `py/example_onnx.py`
-script from the Supertonic GitHub repository.
 ---
-## Test text
-The same text was used for both FP16 and INT8 runs:
-> Greetings! You are listening to your newly quantized model. I have been squished, squeezed, compressed, minimized, optimized, digitized, and lightly traumatized to save disk space. The testing framework automatically verifies my integrity, measures how much weight I lost, and checks if I can still talk without glitching into a robot dolphin. If you can hear this clearly, the quantization ritual was a complete success.
 ---
-## Results
-| Variant | Precision | ONNX directory | Time (s) | Output WAV | Status |
-|--------:|-----------|----------------|---------:|-----------:|--------|
-| FP16    | float16   | `onnx_fp16/`   | 0.914 | `NONE` | FAILED |
-| INT8    | int8      | `onnx_int8/`   | 6.644 | `Greetings__You_are_l_1.wav` | OK |
-> Note:
-> - Exact times will vary depending on Colab hardware, runtime load, and ONNX Runtime version.
-> - The goal of this benchmark is to confirm that both FP16 and INT8 quantized models
->   load correctly and produce intelligible audio for the same input text.
 ---
-## How this benchmark was run
-1. Clone the official Supertonic repository and download its assets (configs + voice styles).
-2. Download `Shadow0482/supertonic-quantized` and copy:
-   - `fp16/*.fp16.onnx` → `onnx_fp16/*.onnx`
-   - `int8_dynamic/*.int8.onnx` → `onnx_int8/*.onnx`
-3. Copy configuration files:
-   - `assets/configs/*.json`
-   - `assets/onnx/tts.json`, `assets/onnx/unicode_indexer.json`
-4. Run:
-```bash
-python py/example_onnx.py \
-  --onnx-dir onnx_fp16 \
-  --voice-style assets/voice_styles/M1.json \
-  --text "..." \
-  --n-test 1 \
-  --save-dir results_fp16
-python py/example_onnx.py \
-  --onnx-dir onnx_int8 \
-  --voice-style assets/voice_styles/M1.json \
-  --text "..." \
-  --n-test 1 \
-  --save-dir results_int8
 ````
 ---
-## License
-The original Supertonic code and models are licensed under their respective
-licenses (MIT code + OpenRAIL-M model). This benchmark and quantized packaging
-follow the same licensing terms.

+# Supertonic Quantized INT8 — Offline TTS (Shadow0482)
+This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech
+pipeline. These models are quantized versions of the official Supertonic models and are
+designed for **offline, low-latency, CPU-friendly inference**.
+FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
+(`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**.
+Therefore, **INT8 is the recommended format** for real-world offline use.
 ---
+# 🚀 Features
+### ✔ 100% Offline Execution
+No network needed. Load ONNX models directly using ONNX Runtime.
+### ✔ Full Supertonic Inference Stack
+- Text Encoder
+- Duration Predictor
+- Vector Estimator
+- Vocoder
+### ✔ INT8 Dynamic Quantization
+- Reduces model sizes dramatically
+- CPU-friendly inference
+- Very low memory usage
+- Compatible with ONNX Runtime CPUExecutionProvider
+### ✔ Same Audio Quality Text Output
+Produces understandable speech while being drastically faster on CPUs.
+---
+# 📦 Repository Structure
+```
+int8_dynamic/
+duration_predictor.int8.onnx
+text_encoder.int8.onnx
+vector_estimator.int8.onnx
+vocoder.int8.onnx
+fp16/
+(Contains experimental FP16 models — vocoder currently unstable)
+```
+Only the **INT8 directory** is guaranteed stable.
+---
+# 🔊 Test Sentence Used in Benchmark
+```
+Greetings! You are listening to your newly quantized model.
+I have been squished, squeezed, compressed, minimized, optimized,
+digitized, and lightly traumatized to save disk space.
+The testing framework automatically verifies my integrity,
+measures how much weight I lost,
+and checks if I can still talk without glitching into a robot dolphin.
+If you can hear this clearly, the quantization ritual was a complete success.
+```
+---
+# 📈 Benchmark Summary (CPU)
+| Model | Precision | Time (s) | Output | Status |
+|-------|-----------|---------:|--------|--------|
+| INT8 Dynamic | int8 | _varies: ~3.0–7.0s_ | `*.wav` | ✅ OK |
+| FP32 (baseline) | float32 | ~2–4× slower | `*.wav` | ✅ OK |
+| FP16 | mixed | ❌ FAILED | — | 🚫 Cannot load vocoder |
+---
+# 🖥️ Offline Inference Guide (Python)
+Below is a clean Python script to run **fully offline INT8 inference**.
 ---
+# 🧩 Requirements
+```
+pip install onnxruntime numpy soundfile
+````
 ---
+# 📜 offline_tts_int8.py
+```python
+import onnxruntime as ort
+import numpy as np
+import json
+import soundfile as sf
+from pathlib import Path
+# ---------------------------------------------------------
+# 1) CONFIG
+# ---------------------------------------------------------
+MODEL_DIR = Path("int8_dynamic")   # folder containing *.int8.onnx
+VOICE_STYLE = "assets/voice_styles/M1.json"
+text_encoder_path      = MODEL_DIR / "text_encoder.int8.onnx"
+duration_pred_path     = MODEL_DIR / "duration_predictor.int8.onnx"
+vector_estimator_path  = MODEL_DIR / "vector_estimator.int8.onnx"
+vocoder_path           = MODEL_DIR / "vocoder.int8.onnx"
+TEST_TEXT = (
+    "Hello! This is the INT8 offline version of Supertonic speaking. "
+    "Everything you hear right now is running fully offline."
+)
+# ---------------------------------------------------------
+# 2) TOKENIZER LOADING
+# ---------------------------------------------------------
+unicode_path = Path("assets/onnx/unicode_indexer.json")
+tokenizer = json.load(open(unicode_path))
+def encode_text(text: str):
+    ids = []
+    for ch in text:
+        if ch in tokenizer["token2idx"]:
+            ids.append(tokenizer["token2idx"][ch])
+        else:
+            ids.append(tokenizer["token2idx"]["<unk>"])
+    return np.array([ids], dtype=np.int64)
+# ---------------------------------------------------------
+# 3) LOAD MODELS (CPU)
+# ---------------------------------------------------------
+def load_session(model_path):
+    return ort.InferenceSession(
+        str(model_path),
+        providers=["CPUExecutionProvider"]
+    )
+sess_text = load_session(text_encoder_path)
+sess_dur  = load_session(duration_pred_path)
+sess_vec  = load_session(vector_estimator_path)
+sess_voc  = load_session(vocoder_path)
+# ---------------------------------------------------------
+# 4) RUN TEXT ENCODER
+# ---------------------------------------------------------
+text_ids = encode_text(TEST_TEXT)
+text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
+style_ttl = np.zeros((1, 50, 256), dtype=np.float32)
+text_out = sess_text.run(
+    None,
+    {
+        "text_ids": text_ids,
+        "text_mask": text_mask,
+        "style_ttl": style_ttl
+    }
+)[0]
+# ---------------------------------------------------------
+# 5) RUN DURATION PREDICTOR
+# ---------------------------------------------------------
+style_dp = np.zeros((1, 8, 16), dtype=np.float32)
+dur_out = sess_dur.run(
+    None,
+    {
+        "text_ids": text_ids,
+        "text_mask": text_mask,
+        "style_dp": style_dp
+    }
+)[0]
+durations = np.maximum(dur_out.astype(int), 1)
+# ---------------------------------------------------------
+# 6) VECTOR ESTIMATOR
+# ---------------------------------------------------------
+latent = sess_vec.run(None, {"latent": text_out})[0]
+# ---------------------------------------------------------
+# 7) VOCODER → WAV
+# ---------------------------------------------------------
+wav = sess_voc.run(None, {"latent": latent})[0][0]
+sf.write("output_int8.wav", wav, 24000)
+print("Saved: output_int8.wav")
 ````
 ---
+# 🎧 Output
+After running:
+```
+python offline_tts_int8.py
+```
+You will get:
+```
+output_int8.wav
+```
+Playable offline on any system.
+---
+# 📝 Notes
+* Only the **INT8** models are stable & recommended.
+* FP16 vocoder currently fails due to a type mismatch in a `Div` node.
+* No internet connection is required for INT8 inference.
+* These models are ideal for embedded or low-spec machines.
+---
+# 📄 License
+Models follow Supertone's licensing terms.
+Quantized versions follow the same licensing.