Text-to-Speech
English

🐈 GitHub: https://github.com/hexgrad/kokoro

πŸš€ Demo: https://hf.co/spaces/hexgrad/Kokoro-TTS

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Releases

Model Published Training Data Langs & Voices SHA256
v1.0 2025 Jan 27 Few hundred hrs 8 & 54 496dba11
v0.19 2024 Dec 25 <100 hrs 1 & 10 3b0c392f
Training Costs v0.19 v1.0 Total
in A100 80GB GPU hours 500 500 1000
average hourly rate $0.80/h $1.20/h $1/h
in USD $400 $600 $1000

Usage

You can run this basic cell on Google Colab. Listen to samples. For more languages and details, see Advanced Usage.

!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkΙ™ΙΉO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkΙ™ΙΉO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    print(i, gs, ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)

Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture:

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: Multiple

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:

  • Public domain audio
  • Audio licensed under Apache, MIT, etc
  • Synthetic audio[1] generated by closed[2] TTS models from large providers
    [1] https://copyright.gov/ai/ai_policy_guidance.pdf
    [2] No synthetic audio from open TTS models or "custom voice clones"

Total Dataset Size: A few hundred hours of audio

Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

Audio Data Duration Used License Added to Training Set After
Koniwa tnc <1h CC BY 3.0 v0.19 / 22 Nov 2024
SIWIS <11h CC BY 4.0 v0.19 / 22 Nov 2024

Acknowledgements

  • πŸ› οΈ @yl4579 for architecting StyleTTS 2.
  • πŸ† @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
  • πŸ“Š Thank you to everyone who contributed synthetic training data.
  • ❀️ Special thanks to all compute sponsors.
  • πŸ‘Ύ Discord server: https://discord.gg/QuGxSWBfQy
  • πŸͺ½ Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an AI in the Terminator franchise.
kokoro
Downloads last month
1,640,436
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for hexgrad/Kokoro-82M

Finetuned
(16)
this model
Adapters
9 models
Finetunes
13 models
Quantizations
9 models

Spaces using hexgrad/Kokoro-82M 100