Text-to-Speech
English
Korakoe commited on
Commit
8e7b84a
1 Parent(s): aa0f5e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -18
README.md CHANGED
@@ -1,5 +1,12 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  ---
4
 
5
  <style>
@@ -25,6 +32,16 @@ license: mit
25
  display: block;
26
  }
27
 
 
 
 
 
 
 
 
 
 
 
28
  </style>
29
 
30
  <hr>
@@ -38,29 +55,74 @@ license: mit
38
 
39
  <hr>
40
 
41
- Vokan features:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- - A diverse dataset for a more authentic zero-shot performance
44
- - Training on 6+ days worth of audio, with 672 diverse and expressive speakers
45
- - Training on 1x H100 for 300 hours and 1x 3090 for an additional 600 hours
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- ### Audio Examples
48
 
49
- <audio controls> <source src="" type="audio/wav"> Your browser does not support the audio embed. </audio>
 
50
 
51
- ### Demo Spaces
52
- Coming soon...
53
 
54
- ## This model was made possible thanks to
55
- - [DagsHub](https://dagshub.com) who sponsored us with their GPU compute (with special thanks to Dean!)
56
- - And the assistance from [camenduru](https://github.com/camenduru) on cloud infrastructure and model training
57
 
 
 
 
 
 
 
 
58
  <hr>
59
 
60
- <a href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
61
- <!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
62
 
63
- ## Citations
64
 
65
  ```citations
66
  @misc{li2023styletts,
@@ -87,9 +149,8 @@ The Centre for Speech Technology Research (CSTR),
87
  University of Edinburgh
88
  ```
89
 
90
- ## License
 
91
  ```
92
  MIT
93
- ```
94
-
95
- Stay tuned for Vokan V2!
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - ShoukanLabs/AniSpeech
5
+ - vctk
6
+ - blabble-io/libritts_r
7
+ language:
8
+ - en
9
+ pipeline_tag: text-to-speech
10
  ---
11
 
12
  <style>
 
32
  display: block;
33
  }
34
 
35
+ audio {
36
+ margin: 0.5rem;
37
+ }
38
+
39
+ .audio-container {
40
+ display: flex;
41
+ justify-content: center;
42
+ align-items: center;
43
+ }
44
+
45
  </style>
46
 
47
  <hr>
 
55
 
56
  <hr>
57
 
58
+ <a href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
59
+ <!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
60
+
61
+
62
+ **Vokan** is an advanced finetuned **StyleTTS2** model crafted for authentic and expressive zero-shot performance. Designed to serve as a better
63
+ base model fo further finetuning in the future!
64
+ It leverages a diverse dataset and extensive training to generate high-quality synthesized speech.
65
+ Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts.
66
+ With over 6+ days worth of audio data and 672 diverse and expressive speakers,
67
+ Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance.
68
+ Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space.
69
+ Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration.
70
+
71
+ You can read more about it on our article on [DagsHub!](dagshub.com/blog/styletts2/)
72
+
73
 
74
+ <hr>
75
+ <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Vokan Samples!</p>
76
+ <div class='audio-container'>
77
+ <div>
78
+ <audio controls>
79
+ <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%201.wav" type="audio/wav">
80
+ Your browser does not support the audio element.
81
+ </audio>
82
+ </div>
83
+
84
+ <div>
85
+ <audio controls>
86
+ <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%202.wav" type="audio/wav">
87
+ Your browser does not support the audio element.
88
+ </audio>
89
+ </div>
90
+ </div>
91
+ <div class='audio-container'>
92
+ <div>
93
+ <audio controls>
94
+ <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%203.wav" type="audio/wav">
95
+ Your browser does not support the audio element.
96
+ </audio>
97
+ </div>
98
+ <div>
99
+ <audio controls>
100
+ <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%204.wav" type="audio/wav">
101
+ Your browser does not support the audio element.
102
+ </audio>
103
+ </div>
104
+ </div>
105
+ <hr>
106
 
107
+ <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Acknowledgements</p>
108
 
109
+ - **[DagsHub](https://dagshub.com):** Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular!
110
+ - **[camenduru](https://github.com/camenduru):** Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow!
111
 
 
 
112
 
113
+ <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Conclusion</p>
 
 
114
 
115
+ V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support!
116
+ This is where you come in, if you have any large single speaker datasets you'd like to contribute,
117
+ in any langauge, you can contribute to our **Vokan dataset**. A large **community dataset** that combines a bunch of
118
+ smaller single speaker datasets to create one big multispeaker one.
119
+ You can upload your uberduck or [FakeYou](https://fakeyou.com/) compliant datasets via the
120
+ **[Vokan](https://huggingface.co/ShoukanLabs/Vokan)** bot on the **[ShoukanLabs Discord Server](https://discord.gg/hdVeretude)**.
121
+ The more data we have, the better the models we produce will be!
122
  <hr>
123
 
 
 
124
 
125
+ <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Citations</p>
126
 
127
  ```citations
128
  @misc{li2023styletts,
 
149
  University of Edinburgh
150
  ```
151
 
152
+ <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">License</p>
153
+
154
  ```
155
  MIT
156
+ ```