Kokoro Overview

Description:

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost‑efficient. Kokoro can be deployed anywhere from production environments to personal projects.
Kokoro was developed by hexgrad.
This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA hexgrad Model Card

License/Terms of Use:

Apache-2.0

Deployment Geography:

Global

Use Case:

Developers and enterprises building text‑to‑speech applications, voice assistants, and audio generation services. Suitable for any domain that requires high‑quality, low‑latency speech synthesis, from production APIs to personal projects.

Release Date:

HuggingFace: 05/29/2026 via [URL]

Reference(s):

StyleTTS 2
ISTFTNet

Model Architecture:

Architecture Type: Transformer
Network Architecture: StyleTTS 2, ISTFTNet, Decoder only
This model was developed based on yl4579/StyleTTS2-LJSpeech.
Number of model parameters: 82M (8.2*10^7)

Input:

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other properties related to input: Input Length: max length ~500 tokens, recommend to split input into chunks 100-200 tokens long Input Language: English - full support, Japanese, Mandarin Chinese, Spanish, French, Hindi, Italian, Brazilian Portugese - partial support

Output:

Output Type(s): Audio
Output Format: Audio (.wav, .mp3)
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Audio output duration is approximately one minute per 1,000 characters of input text.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

ONNXRuntime win-x64-gpu_cuda13-1.24.3 Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Lovelace
NVIDIA Turing [Preferred/Supported] Operating System(s): Windows 10/11

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: Undisclosed Data Modality: Audio
Audio Training Data Size: Less than 10,000 Hours
Data Collection Method by dataset: Hybrid: Automated, Synthetic
Labeling Method by dataset: Automated
Properties (Quantity, Dataset Descriptions, Sensor(s)): Kokoro was trained exclusively on permissive, non‑copyrighted audio data and IPA phoneme labels. The dataset comprises public‑domain recordings, audio released under permissive licenses, and synthetic audio generated by closed‑source TTS models. Overall, the training corpus amounts to a few hundred hours of audio.

Testing Dataset:

Link: Undisclosed Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed

Evaluation Dataset:

Link: Undisclosed Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed

Inference:

Acceleration Engine:

TensorRT
CUDA
CoreML
Xnnpack
Nnapi
DirectML

Test Hardware:

NVIDIA GeForce RTX 4090
NVIDIA GeForce RTX 3070 Ti
NVIDIA GeForce RTX 2060

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or concerns here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/kokoro-82M-onnx-opt

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(43)

this model

Papers for nvidia/kokoro-82M-onnx-opt

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 15

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Paper • 2203.02395 • Published Mar 4, 2022 • 1