Upload README.md
Browse files
README.md
CHANGED
@@ -14,14 +14,17 @@ pipeline_tag: text-to-speech
|
|
14 |
|
15 |
On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
|
16 |
|
17 |
-
|
18 |
-
- XTTS v2: 467M, CPML, >10k hours
|
19 |
-
- Edge TTS: Microsoft, proprietary
|
20 |
-
- MetaVoice: 1.2B, Apache, 100k hours
|
21 |
-
- Parler Mini: 880M, Apache, 45k hours
|
22 |
-
- Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
|
23 |
|
24 |
-
Kokoro
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
27 |
|
@@ -40,21 +43,30 @@ from models import build_model
|
|
40 |
import torch
|
41 |
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
42 |
MODEL = build_model('kokoro-v0_19.pth', device)
|
43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
# 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
|
46 |
from kokoro import generate
|
47 |
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
|
48 |
-
audio, out_ps = generate(MODEL, text, VOICEPACK)
|
|
|
|
|
|
|
49 |
|
50 |
# 4️⃣ Display the 24khz audio and print the output phonemes
|
51 |
from IPython.display import display, Audio
|
52 |
display(Audio(data=audio, rate=24000, autoplay=True))
|
53 |
print(out_ps)
|
54 |
```
|
55 |
-
|
56 |
|
57 |
-
### Model
|
58 |
|
59 |
No affiliation can be assumed between parties on different lines.
|
60 |
|
@@ -67,15 +79,16 @@ No affiliation can be assumed between parties on different lines.
|
|
67 |
|
68 |
**Trained by**: `@rzvzn` on Discord
|
69 |
|
70 |
-
**Supported Languages:** English
|
71 |
|
72 |
**Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
|
73 |
|
74 |
-
|
75 |
- 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
|
76 |
- 26 Dec 2024: `am_adam`, `am_michael`
|
|
|
77 |
|
78 |
-
|
79 |
- Apache 2.0 weights in this repository
|
80 |
- MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
81 |
- GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
|
@@ -117,12 +130,14 @@ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
|
|
117 |
|
118 |
### Limitations
|
119 |
|
120 |
-
Kokoro v0.19 is limited in some ways,
|
121 |
- [Data] Lacks voice cloning capability, likely due to small <100h training set
|
122 |
- [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
|
123 |
- [Data] Training dataset is mostly long-form reading and narration, not conversation
|
124 |
- [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
|
125 |
-
- [Data] Multilingual capability is architecturally feasible, but training data is
|
|
|
|
|
126 |
|
127 |
**Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
128 |
|
|
|
14 |
|
15 |
On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
|
16 |
|
17 |
+
As of 28 Dec 2024, **8 unique Voicepacks have been released**: 2F 2M each for American and British English.
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
+
At the time of release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
|
20 |
+
1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio, for <20 epochs**
|
21 |
+
2. XTTS v2: 467M, CPML, >10k hours
|
22 |
+
3. Edge TTS: Microsoft, proprietary
|
23 |
+
4. MetaVoice: 1.2B, Apache, 100k hours
|
24 |
+
5. Parler Mini: 880M, Apache, 45k hours
|
25 |
+
6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
|
26 |
+
|
27 |
+
Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
|
28 |
|
29 |
You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
30 |
|
|
|
43 |
import torch
|
44 |
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
45 |
MODEL = build_model('kokoro-v0_19.pth', device)
|
46 |
+
VOICE_NAME = [
|
47 |
+
'af', # Default voice is a 50-50 mix of af_bella & af_sarah
|
48 |
+
'af_bella', 'af_sarah', 'am_adam', 'am_michael',
|
49 |
+
'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
|
50 |
+
][0]
|
51 |
+
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
|
52 |
+
print(f'Loaded voice: {VOICE_NAME}')
|
53 |
|
54 |
# 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
|
55 |
from kokoro import generate
|
56 |
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
|
57 |
+
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
|
58 |
+
# Language is determined by the first letter of the VOICE_NAME:
|
59 |
+
# 🇺🇸 'a' => American English => en-us
|
60 |
+
# 🇬🇧 'b' => British English => en-gb
|
61 |
|
62 |
# 4️⃣ Display the 24khz audio and print the output phonemes
|
63 |
from IPython.display import display, Audio
|
64 |
display(Audio(data=audio, rate=24000, autoplay=True))
|
65 |
print(out_ps)
|
66 |
```
|
67 |
+
The inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
|
68 |
|
69 |
+
### Model Facts
|
70 |
|
71 |
No affiliation can be assumed between parties on different lines.
|
72 |
|
|
|
79 |
|
80 |
**Trained by**: `@rzvzn` on Discord
|
81 |
|
82 |
+
**Supported Languages:** American English, British English
|
83 |
|
84 |
**Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
|
85 |
|
86 |
+
### Releases
|
87 |
- 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
|
88 |
- 26 Dec 2024: `am_adam`, `am_michael`
|
89 |
+
- 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
|
90 |
|
91 |
+
### Licenses
|
92 |
- Apache 2.0 weights in this repository
|
93 |
- MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
94 |
- GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
|
|
|
130 |
|
131 |
### Limitations
|
132 |
|
133 |
+
Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
|
134 |
- [Data] Lacks voice cloning capability, likely due to small <100h training set
|
135 |
- [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
|
136 |
- [Data] Training dataset is mostly long-form reading and narration, not conversation
|
137 |
- [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
|
138 |
+
- [Data] Multilingual capability is architecturally feasible, but training data is mostly English
|
139 |
+
|
140 |
+
Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
|
141 |
|
142 |
**Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
|
143 |
|