π GitHub: https://github.com/hexgrad/kokoro
π Demo: https://hf.co/spaces/hexgrad/Kokoro-TTS
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
- Releases
- Usage
- EVAL.md βοΈ
- SAMPLES.md βοΈ
- VOICES.md βοΈ
- Model Facts
- Training Details
- Creative Commons Attribution
- Acknowledgements
Releases
Model | Published | Training Data | Langs & Voices | SHA256 |
---|---|---|---|---|
v1.0 | 2025 Jan 27 | Few hundred hrs | 8 & 54 | 496dba11 |
v0.19 | 2024 Dec 25 | <100 hrs | 1 & 10 | 3b0c392f |
Training Costs | v0.19 | v1.0 | Total |
---|---|---|---|
in A100 80GB GPU hours | 500 | 500 | 1000 |
average hourly rate | $0.80/h | $1.20/h | $1/h |
in USD | $400 | $600 | $1000 |
Usage
You can run this basic cell on Google Colab. Listen to samples. For more languages and details, see Advanced Usage.
!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kΛOkΙΙΉO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kΛOkΙΙΉO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
print(i, gs, ps)
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000)
Under the hood, kokoro
uses misaki
, a G2P library at https://github.com/hexgrad/misaki
Model Facts
Architecture:
- StyleTTS 2: https://arxiv.org/abs/2306.07691
- ISTFTNet: https://arxiv.org/abs/2203.02395
- Decoder only: no diffusion, no encoder release
Architected by: Li et al @ https://github.com/yl4579/StyleTTS2
Trained by: @rzvzn
on Discord
Languages: Multiple
Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4
Training Details
Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
- Public domain audio
- Audio licensed under Apache, MIT, etc
- Synthetic audio[1] generated by closed[2] TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or "custom voice clones"
Total Dataset Size: A few hundred hours of audio
Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM
Creative Commons Attribution
The following CC BY audio was part of the dataset used to train Kokoro v1.0.
Audio Data | Duration Used | License | Added to Training Set After |
---|---|---|---|
Koniwa tnc |
<1h | CC BY 3.0 | v0.19 / 22 Nov 2024 |
SIWIS | <11h | CC BY 4.0 | v0.19 / 22 Nov 2024 |
Acknowledgements
- π οΈ @yl4579 for architecting StyleTTS 2.
- π @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
- π Thank you to everyone who contributed synthetic training data.
- β€οΈ Special thanks to all compute sponsors.
- πΎ Discord server: https://discord.gg/QuGxSWBfQy
- πͺ½ Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an AI in the Terminator franchise.

- Downloads last month
- 1,640,436
Model tree for hexgrad/Kokoro-82M
Base model
yl4579/StyleTTS2-LJSpeech