Kokoro 1.0 [ONNX]

Notes

The speed input was changed to use float32; it is int64 in the original export script. To replicate this, clone https://github.com/hexgrad/kokoro, apply onnx_exporter.patch to the repository, and run examples/export.py.

Voice files were converted from PyTorch format to HDF5 with voice_pt_to_h5.py.

Usage

import soundfile as sf  # type: ignore
import h5py  # type: ignore
import onnxruntime as ort  # type: ignore
import numpy as np


if __name__ == "__main__":
    tokens: list[int] = [0, 52, 157, 135, 123, 16, 64, 156, 102, 147, 83, 56, 16, 65, 102, 54, 16, 44, 83, 53, 156, 138, 55, 16, 53, 54, 156, 102, 123, 16, 156, 31, 56, 54, 51, 16, 65, 157, 86, 56, 16, 52, 63, 16, 54, 156, 135, 53, 16, 156, 102, 56, 62, 63, 16, 52, 135, 123, 16, 50, 156, 69, 123, 62, 4, 16, 50, 157, 63, 16, 54, 156, 135, 53, 61, 16, 157, 39, 62, 61, 156, 25, 46, 3, 16, 46, 123, 156, 51, 55, 68, 4, 16, 50, 157, 63, 16, 54, 156, 135, 53, 61, 16, 102, 56, 61, 156, 25, 46, 3, 16, 83, 65, 156, 24, 53, 83, 56, 68, 4, 0]
    voice: str = "af_heart"
    speed: float = 1.0

    model_session: ort.InferenceSession = ort.InferenceSession("Kokoro-1.0-FP32.onnx")

    with h5py.File(f"voices/{voice}.h5", mode="r") as file:
        dataset: np.ndarray = np.array(file[str(len(tokens) - 2)])  # type: ignore

    waveform, duration = model_session.run(  # type: ignore
        None,
        {
            "input_ids": np.array(tokens).reshape(1, -1),
            "style": dataset.reshape(1, -1),
            "speed": np.array([speed], dtype=np.float32),
        },
    )
    sf.write("output.wav", waveform, 24000)  # type: ignore

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for alexisStacksCode/Kokoro-1.0-ONNX

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(37)

this model