File size: 6,225 Bytes
783509b 8e7b84a a9d5ac1 783509b 097c9df 8e7b84a 097c9df aa0f5e5 097c9df 8426fb8 8e7b84a 49bee3c 8e7b84a bc1227e 8e7b84a 097c9df 8e7b84a 097c9df 6875147 097c9df 8e7b84a 097c9df 233db79 097c9df 6875147 097c9df 8e7b84a 49bee3c 8e7b84a 1f208e7 8e7b84a 65f5171 097c9df 6875147 097c9df 6875147 8e7b84a 097c9df 8e7b84a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: mit
datasets:
- ShoukanLabs/AniSpeech
- vctk
- blabble-io/libritts_r
language:
- en
pipeline_tag: text-to-speech
base_model: yl4579/StyleTTS2-LibriTTS
---
<style>
.TitleContainer {
background-color: #ffff;
margin-bottom: 0rem;
margin-left: auto;
margin-right: auto;
width: 40%;
height: 30%;
border-radius: 10rem;
border: 0.5vw solid #ff593e;
transition: .6s;
}
.TitleContainer:hover {
transform: scale(1.05);
}
.VokanLogo {
margin: auto;
display: block;
}
audio {
margin: 0.5rem;
}
.audio-container {
display: flex;
justify-content: center;
align-items: center;
}
</style>
<hr>
<div class="TitleContainer" align="center">
<!--<img src="https://huggingface.co/ShoukanLabs/Vokan/resolve/main/Vokan.gif" class="VokanLogo">-->
<img src="Vokan.gif" class="VokanLogo">
</div>
<p align="center", style="font-size: 1vw; font-weight: bold; color: #ff593e;">A StyleTTS2 fine-tune, designed for expressiveness.</p>
<hr>
<div class='audio-container'>
<a align="center" href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
<!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
</div>
**Vokan** is an advanced finetuned **StyleTTS2** model crafted for authentic and expressive zero-shot performance. Designed to serve as a better
base model for further finetuning in the future!
It leverages a diverse dataset and extensive training to generate high-quality synthesized speech.
Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts.
With over 6+ days worth of audio data and 672 diverse and expressive speakers,
Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance.
Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space.
Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration.
You can read more about it on our article on [DagsHub!](https://dagshub.com/blog/styletts2/)
<hr>
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Vokan Samples!</p>
<div class='audio-container'>
<div>
<audio controls>
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%201.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</div>
<div>
<audio controls>
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%202.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</div>
</div>
<div class='audio-container'>
<div>
<audio controls>
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%203.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</div>
<div>
<audio controls>
<source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%204.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</div>
</div>
<hr>
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Acknowledgements!</p>
- **[DagsHub](https://dagshub.com):** Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular!
- **[camenduru](https://github.com/camenduru):** Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow!
<hr>
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Conclusion!</p>
V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support!
This is where you come in, if you have any large single speaker datasets you'd like to contribute,
in any language, you can contribute to our **Vokan dataset**. A large **community dataset** that combines a bunch of
smaller single speaker datasets to create one big multispeaker one.
You can upload your uberduck or FakeYou compliant datasets via the
**[Vokan](https://huggingface.co/ShoukanLabs/Vokan)** bot on the **[ShoukanLabs Discord Server](https://discord.gg/hdVeretude)**.
The more data we have, the better the models we produce will be!
[This model is also available on DagsHub](https://dagshub.com/ShoukanLabs/Vokan)
<hr>
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Citations!</p>
```citations
@misc{li2023styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
year={2023},
eprint={2306.07691},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@misc{zen2019libritts,
title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
year={2019},
eprint={1904.02882},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
The Centre for Speech Technology Research (CSTR),
University of Edinburgh
```
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">License!</p>
```
MIT
``` |