Text-to-Speech
English
hexgrad commited on
Commit
b08fa22
1 Parent(s): d53ec79

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -16
README.md CHANGED
@@ -14,14 +14,17 @@ pipeline_tag: text-to-speech
14
 
15
  On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
16
 
17
- At the time of release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena). With 82M params trained for <20 epochs on <100 total hours of audio, Kokoro achieved higher Elo in this single-voice Arena setting over models such as:
18
- - XTTS v2: 467M, CPML, >10k hours
19
- - Edge TTS: Microsoft, proprietary
20
- - MetaVoice: 1.2B, Apache, 100k hours
21
- - Parler Mini: 880M, Apache, 45k hours
22
- - Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
23
 
24
- Kokoro's ability to top this Elo ladder using relatively low compute and data suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.
 
 
 
 
 
 
 
 
25
 
26
  You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
27
 
@@ -40,21 +43,30 @@ from models import build_model
40
  import torch
41
  device = 'cuda' if torch.cuda.is_available() else 'cpu'
42
  MODEL = build_model('kokoro-v0_19.pth', device)
43
- VOICEPACK = torch.load('voices/af.pt', weights_only=True).to(device)
 
 
 
 
 
 
44
 
45
  # 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
46
  from kokoro import generate
47
  text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
48
- audio, out_ps = generate(MODEL, text, VOICEPACK)
 
 
 
49
 
50
  # 4️⃣ Display the 24khz audio and print the output phonemes
51
  from IPython.display import display, Audio
52
  display(Audio(data=audio, rate=24000, autoplay=True))
53
  print(out_ps)
54
  ```
55
- This inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
56
 
57
- ### Model Description
58
 
59
  No affiliation can be assumed between parties on different lines.
60
 
@@ -67,15 +79,16 @@ No affiliation can be assumed between parties on different lines.
67
 
68
  **Trained by**: `@rzvzn` on Discord
69
 
70
- **Supported Languages:** English
71
 
72
  **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
73
 
74
- **Releases:**
75
  - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
76
  - 26 Dec 2024: `am_adam`, `am_michael`
 
77
 
78
- **Licenses:**
79
  - Apache 2.0 weights in this repository
80
  - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
81
  - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
@@ -117,12 +130,14 @@ assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
117
 
118
  ### Limitations
119
 
120
- Kokoro v0.19 is limited in some ways, in its training set and architecture:
121
  - [Data] Lacks voice cloning capability, likely due to small <100h training set
122
  - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
123
  - [Data] Training dataset is mostly long-form reading and narration, not conversation
124
  - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
125
- - [Data] Multilingual capability is architecturally feasible, but training data is almost entirely English
 
 
126
 
127
  **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
128
 
 
14
 
15
  On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
16
 
17
+ As of 28 Dec 2024, **8 unique Voicepacks have been released**: 2F 2M each for American and British English.
 
 
 
 
 
18
 
19
+ At the time of release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
20
+ 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio, for <20 epochs**
21
+ 2. XTTS v2: 467M, CPML, >10k hours
22
+ 3. Edge TTS: Microsoft, proprietary
23
+ 4. MetaVoice: 1.2B, Apache, 100k hours
24
+ 5. Parler Mini: 880M, Apache, 45k hours
25
+ 6. Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
26
+
27
+ Kokoro's ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
28
 
29
  You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
30
 
 
43
  import torch
44
  device = 'cuda' if torch.cuda.is_available() else 'cpu'
45
  MODEL = build_model('kokoro-v0_19.pth', device)
46
+ VOICE_NAME = [
47
+ 'af', # Default voice is a 50-50 mix of af_bella & af_sarah
48
+ 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
49
+ 'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
50
+ ][0]
51
+ VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
52
+ print(f'Loaded voice: {VOICE_NAME}')
53
 
54
  # 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
55
  from kokoro import generate
56
  text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
57
+ audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
58
+ # Language is determined by the first letter of the VOICE_NAME:
59
+ # 🇺🇸 'a' => American English => en-us
60
+ # 🇬🇧 'b' => British English => en-gb
61
 
62
  # 4️⃣ Display the 24khz audio and print the output phonemes
63
  from IPython.display import display, Audio
64
  display(Audio(data=audio, rate=24000, autoplay=True))
65
  print(out_ps)
66
  ```
67
+ The inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
68
 
69
+ ### Model Facts
70
 
71
  No affiliation can be assumed between parties on different lines.
72
 
 
79
 
80
  **Trained by**: `@rzvzn` on Discord
81
 
82
+ **Supported Languages:** American English, British English
83
 
84
  **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
85
 
86
+ ### Releases
87
  - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
88
  - 26 Dec 2024: `am_adam`, `am_michael`
89
+ - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
90
 
91
+ ### Licenses
92
  - Apache 2.0 weights in this repository
93
  - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
94
  - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
 
130
 
131
  ### Limitations
132
 
133
+ Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:
134
  - [Data] Lacks voice cloning capability, likely due to small <100h training set
135
  - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
136
  - [Data] Training dataset is mostly long-form reading and narration, not conversation
137
  - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
138
+ - [Data] Multilingual capability is architecturally feasible, but training data is mostly English
139
+
140
+ Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/discussions/5) to better understand these limitations.
141
 
142
  **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
143