jbetker commited on
Commit
52808b3
·
1 Parent(s): ce42023

Re-push space files

Browse files
Files changed (2) hide show
  1. README.md +14 -260
  2. app.py +7 -0
README.md CHANGED
@@ -1,260 +1,14 @@
1
- # TorToiSe
2
-
3
- Tortoise is a text-to-speech program built with the following priorities:
4
-
5
- 1. Strong multi-voice capabilities.
6
- 2. Highly realistic prosody and intonation.
7
-
8
- This repo contains all the code needed to run Tortoise TTS in inference mode.
9
-
10
- ### New features
11
-
12
- #### v2.1; 2022/5/2
13
- - Added ability to produce totally random voices.
14
- - Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
15
- - Added ability to use your own pretrained models.
16
- - Refactored directory structures.
17
- - Performance improvements & bug fixes.
18
-
19
- ## What's in a name?
20
-
21
- I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
22
- is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
23
- sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
24
-
25
- ## Demos
26
-
27
- See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.
28
-
29
- ## Usage guide
30
-
31
- ### Colab
32
-
33
- Colab is the easiest way to try this out. I've put together a notebook you can use here:
34
- https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
35
-
36
- ### Installation
37
-
38
- If you want to use this on your own computer, you must have an NVIDIA GPU. First, install pytorch using these
39
- instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
40
-
41
- Then:
42
-
43
- ```shell
44
- git clone https://github.com/neonbjb/tortoise-tts.git
45
- cd tortoise-tts
46
- python setup.py install
47
- ```
48
-
49
- ### do_tts.py
50
-
51
- This script allows you to speak a single phrase with one or more voices.
52
- ```shell
53
- python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
54
- ```
55
-
56
- ### read.py
57
-
58
- This script provides tools for reading large amounts of text.
59
-
60
- ```shell
61
- python tortoise/read.py --textfile <your text to be read> --voice random
62
- ```
63
-
64
- This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
65
- of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
66
- output that as well.
67
-
68
- Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
69
- argument.
70
-
71
- ### API
72
-
73
- Tortoise can be used programmatically, like so:
74
-
75
- ```python
76
- reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
77
- tts = api.TextToSpeech()
78
- pcm_audio = tts.tts_with_preset("your text here", reference_clips, preset='fast')
79
- ```
80
-
81
- ## Voice customization guide
82
-
83
- Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
84
-
85
- These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
86
-
87
- ### Random voice
88
-
89
- I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run
90
- it. The results are quite fascinating and I recommend you play around with it!
91
-
92
- You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.
93
-
94
- For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.
95
-
96
- ### Provided voices
97
-
98
- This repo comes with several pre-packaged voices. You will be familiar with many of them. :)
99
-
100
- Most of the provided voices were not found in the training set. Experimentally, it seems that voices from the training set
101
- produce more realistic outputs then those outside of the training set. Any voice prepended with "train" came from the
102
- training set.
103
-
104
- ### Adding a new voice
105
-
106
- To add new voices to Tortoise, you will need to do the following:
107
-
108
- 1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
109
- 2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
110
- 3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
111
- 4. Create a subdirectory in voices/
112
- 5. Put your clips in that subdirectory.
113
- 6. Run tortoise utilities with --voice=<your_subdirectory_name>.
114
-
115
- ### Picking good reference clips
116
-
117
- As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
118
- good clips:
119
-
120
- 1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
121
- 2. Avoid speeches. These generally have distortion caused by the amplification system.
122
- 3. Avoid clips from phone calls.
123
- 4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
124
- 5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
125
- 6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
126
-
127
- ## Advanced Usage
128
-
129
- ### Generation settings
130
-
131
- Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
132
- that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
133
- various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
134
- set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
135
- these settings (and it's very likely that I missed something!)
136
-
137
- These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
138
- ```api.tts``` for a full list.
139
-
140
- ### Prompt engineering
141
-
142
- Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
143
- by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
144
- take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
145
- prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
146
-
147
- ### Playing with the voice latent
148
-
149
- Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
150
- then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
151
- are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
152
-
153
- This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
154
- what it thinks the "average" of those two voices sounds like.
155
-
156
- #### Generating conditioning latents from voices
157
-
158
- Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
159
- will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
160
-
161
- Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
162
-
163
- #### Using raw conditioning latents to generate speech
164
-
165
- After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
166
- ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
167
-
168
- ### Send me feedback!
169
-
170
- Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible
171
- utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with
172
- GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here,
173
- please report it to me! I would be glad to publish it to this page.
174
-
175
- ## Tortoise-detect
176
-
177
- Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
178
- came from Tortoise.
179
-
180
- This classifier can be run on any computer, usage is as follows:
181
-
182
- ```commandline
183
- python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
184
- ```
185
-
186
- This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
187
- as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
188
- positives.
189
-
190
- ## Model architecture
191
-
192
- Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
193
- models that work together. I've assembled a write-up of the system architecture here:
194
- [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
195
-
196
- ## Training
197
-
198
- These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
199
- ~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
200
- [DLAS](https://github.com/neonbjb/DL-Art-School) trainer.
201
-
202
- I currently do not have plans to release the training configurations or methodology. See the next section..
203
-
204
- ## Ethical Considerations
205
-
206
- Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
207
- wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
208
- could be misused are many. It doesn't take much creativity to think up how.
209
-
210
- After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
211
-
212
- 1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
213
- 2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
214
- 3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
215
- 4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
216
- 5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
217
-
218
- ### Diversity
219
-
220
- The diversity expressed by ML models is strongly tied to the datasets they were trained on.
221
-
222
- Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
223
- balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
224
- or of people who speak with strong accents.
225
-
226
- ## Looking forward
227
-
228
- Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
229
- training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
230
- of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
231
- exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
232
-
233
- I want to mention here
234
- that I think Tortoise could do be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
235
- or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
236
- to believe that the same is not true of TTS.
237
-
238
- The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer.
239
- Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.
240
-
241
- If you are an ethical organization with computational resources to spare interested in seeing what this model could do
242
- if properly scaled out, please reach out to me! I would love to collaborate on this.
243
-
244
- ## Acknowledgements
245
-
246
- This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to
247
- credit a few of the amazing folks in the community that have helped make this happen:
248
-
249
- - Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
250
- - [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise.
251
- - [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model.
252
- - [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses.
253
- - [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here.
254
- - [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset.
255
-
256
- ## Notice
257
-
258
- Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
259
-
260
- If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.
 
1
+ ---
2
+ title: TorToiSe
3
+ emoji: 🐢
4
+ colorFrom: yellow
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 2.9.4
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ models: jbetker/tortoise-tts-v2
12
+ ---
13
+
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+ def greet(name):
4
+ return "Hello " + name + "!!"
5
+
6
+ iface = gr.Interface(fn=greet, inputs="text", outputs="text")
7
+ iface.launch()