File size: 7,971 Bytes
203c725 02b66d6 203c725 fcb6d91 7524a65 fcb6d91 f355716 203c725 fd4e335 7a3a819 32b4cad 203c725 489f844 8610166 203c725 1432f65 203c725 fcb6d91 203c725 5729d01 203c725 ed14914 203c725 1432f65 203c725 1432f65 203c725 1432f65 203c725 2c67092 c36736b 203c725 ea2e8df 5729d01 1432f65 203c725 b3d8d82 203c725 0ab1623 203c725 ff72b77 25b9f5b ff72b77 203c725 529ca87 203c725 529ca87 203c725 529ca87 96d0e4b 529ca87 203c725 ff72b77 203c725 f0c3744 203c725 fc877d8 203c725 fc877d8 3c4fc5d 203c725 c23b7b2 203c725 28fe4a5 203c725 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
---
thumbnail: https://i.postimg.cc/y6gT18Tn/Untitled-design-1.png
license: cc-by-nc-4.0
language:
- ja
pipeline_tag: text-to-speech
tags:
- '#StyleTTS'
- '#Japanese'
- Diffusion
- Prompt
- '#TTS'
- '#TexttoSpeech'
- '#speech'
- '#StyleTTS2'
---
<div style="text-align:center;">
<img src="https://i.postimg.cc/y6gT18Tn/Untitled-design-1.png" alt="Logo" style="width:300px; height:auto;">
</div>
# Tsukasa 司 Speech: Engineering the Naturalness and Rich Expressiveness
**tl;dr** : I made a very cool japanese speech generation model.
日本語のモデルカードは[こちら](https://huggingface.co/Respair/Tsukasa_Speech/blob/main/README_JP.md)。
Part of a [personal project](https://github.com/Respaired/Project-Kanade), focusing on further advancing Japanese speech field.
- Use the HuggingFace Space for **Tsukasa** (24khz): [![huggingface](https://img.shields.io/badge/Interactive_Demo-HuggingFace-yellow)](https://huggingface.co/spaces/Respair/Tsukasa_Speech)
- HuggingFace Space for **Tsumugi** (48khz): [![huggingface](https://img.shields.io/badge/Interactive_Demo-HuggingFace-yellow)](https://huggingface.co/spaces/Respair/Tsumugi_48khz)
- Join Shoukan lab's discord server, a comfy place I frequently visit -> [![Discord](https://img.shields.io/discord/1197679063150637117?logo=discord&logoColor=white&label=Join%20our%20Community)](https://discord.gg/JrPSzdcM)
Github's repo:
[![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Respaired/Tsukasa-Speech)
## What is this?
*Note*: This model only supports the Japanese language; but you can feed it Romaji if you use the Gradio demo.
This is a speech generation network, aimed at maximizing the expressiveness and Controllability of the generated speech. at its core it uses [StyleTTS 2](https://github.com/yl4579/StyleTTS2)'s architecture with the following changes:
- Incorporating mLSTM Layers instead of regular PyTorch LSTM layers, and increasing the capacity of the text and prosody encoder by using a higher number of parameters
- Retrained PL-Bert, Pitch Extractor, Text Aligner from scratch
- Whisper's Encoder instead of WavLM for the SLM
- 48khz Config
- improved Performance on non-verbal sounds and cues. such as sigh, pauses, etc. and also very slightly on laughter (depends on the speaker)
- a new way of sampling the Style Vectors.
- Promptable Speech Synthesizing.
- a Smart Phonemization algorithm that can handle Romaji inputs or a mixture of Japanese and Romaji.
- Fixed DDP and BF16 Training (mostly!)
There are two checkpoints you can use. Tsukasa & Tsumugi 48khz (placeholder).
Tsukasa was trained on ~800 hours of studio grade, high quality data. sourced mainly from games and novels, part of it from a private dataset.
So the Japanese is going to be the "anime japanese" (it's different than what people usually speak in real-life.)
For Tsumugi (placeholder) a subset of this data was used with a 48khz config; at around ~300 hours but in a more controlled manner with additional manual cleaning & annotations.
**Unfortuantely Tsumugi (48khz)'s context length is capped and that means the model will not have enough information to handle the intonations as good as Tsukasa.
it also only supports the first mode of Kotodama's inference, which means no voice design.**
Brought to you by:
- Soshyant (me)
- [Auto Meta](https://github.com/Alignment-Lab-AI)
- [Cryptowooser](https://github.com/cryptowooser)
- [Buttercream](https://github.com/korakoe)
Special thanks to Yinghao Aaron Li, the Author of StyleTTS which this work is based on top of that. <br> He is one of the most talented Engineers I've ever seen in this field.
Also Karesto and Raven for their help in debugging some of the scripts. wonderful people.
___________________________________________________________________________________
## Why does it matter?
Recently, there's a big trend towards larger models, increasing the scale. We're going the opposite way, trying to see how far we can push the limits by utilizing existing tools.
Maybe, just maybe, scale is not necessarily the answer.
There's also a few things that's related to Japanese (but can have a wider impact on languages that face a similar issue like Arabic). such as how we can improve the intonations for this language. what can be done to accurately annotate a text that can have various spellings depending on the context, etc.
## How to do ...
## Pre-requisites
1. Python >= 3.11
2. Clone this repository:
```bash
git clone https://huggingface.co/Respair/Tsukasa_Speech
cd Tsukasa_Speech
```
3. Install python requirements:
```bash
pip install -r requirements.txt
```
# Inference:
Gradio demo:
```bash
python app_tsuka.py
```
or check the inference notebook. before that, make sure you read the **Important Notes** section down below.
# Training:
**First stage training**:
```bash
accelerate launch train_first.py --config_path ./Configs/config.yml
```
**Second stage training**:
```bash
accelerate launch accelerate_train_second.py --config_path ./Configs/config.yml
```
SLM Joint-Training doesn't work on multigpu. (you don't need it, i didn't use it too.)
or:
```bash
launch train_first.py --config_path ./Configs/config.yml
```
**Third stage training** (Kotodama, prompt encoding, etc.):
```
not planned right now, due to some constraints, but feel free to replicate.
```
## some ideas for future
I can think of a few things that can be improved, not nessarily by me, treat it as some sorts of suggestions:
- [o] changing the decoder ([fregrad](https://github.com/kaistmm/fregrad) looks promising)
- [o] retraining the Pitch Extractor using a different algorithm
- [o] while the quality of non-speech sounds have been improved, it cannot generate an entirely non-speech output, perhaps because of the hard alignement.
- [o] using the Style encoder as another modality in LLMs, since they have a detailed representation of the tone and expression of a speech (similar to Style-Talker).
## Pre-requisites
1. Python >= 3.11
2. Clone this repository:
```bash
git clone https://huggingface.co/Respair/Tsukasa_Speech
cd Tsukasa_Speech
```
3. Install python requirements:
```bash
pip install -r requirements.txt
```
## Training details
- 8x A40s + 2x V100s(32gb each)
- 750 ~ 800 hours of data
- Bfloat16
- Approximately 3 weeks of training, overall 3 months including the work spent on the data pipeline.
- Roughly 66.6 kg of CO2eq. of Carbon emitted if we base it on Google Cloud. (I didn't use Google, but the cluster is located in US, please treat it as a very rough approximation.)
### Important Notes
Check [here](not)
Any questions?
```email
saoshiant@protonmail.com
```
or simply DM me on discord.
## Some cool projects:
[Kokoro]("https://huggingface.co/spaces/hexgrad/Kokoro-TTS") - a very nice and light weight TTS, based on StyleTTS. supports Japanese and English.<br>
[VoPho]("https://github.com/ShoukanLabs/VoPho") - a meta phonemizer to rule them all. it will automatically handle any languages with hand-picked high quality phonemizers.
## References
- [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
- [NX-AI/xlstm](https://github.com/NX-AI/xlstm)
- [archinetai/audio-diffusion-pytorch](https://github.com/archinetai/audio-diffusion-pytorch)
- [jik876/hifi-gan](https://github.com/jik876/hifi-gan)
- [rishikksh20/iSTFTNet-pytorch](https://github.com/rishikksh20/iSTFTNet-pytorch)
- [nii-yamagishilab/project-NN-Pytorch-scripts/project/01-nsf](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf)
```
@article{xlstm,
title={xLSTM: Extended Long Short-Term Memory},
author={Beck, Maximilian and P{\"o}ppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp},
journal={arXiv preprint arXiv:2405.04517},
year={2024}
}
``` |