CodingBillionaire commited on
Commit
c5df5e1
·
1 Parent(s): ee04bc2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -157
README.md CHANGED
@@ -1,160 +1,11 @@
1
- # TorToiSe
 
 
 
 
 
 
2
 
3
- Tortoise is a text-to-speech program built with the following priorities:
4
-
5
- 1. Strong multi-voice capabilities.
6
- 2. Highly realistic prosody and intonation.
7
-
8
- This repo contains all the code needed to run Tortoise TTS in inference mode.
9
-
10
- ### Colab
11
-
12
- Colab is the easiest way to try this out. I've put together a notebook you can use here:
13
- https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
14
-
15
- ### Local Installation
16
-
17
- If you want to use this on your own computer, you must have an NVIDIA GPU.
18
-
19
- First, install pytorch using these instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).
20
- On Windows, I **highly** recommend using the Conda installation path. I have been told that if you do not do this, you
21
- will spend a lot of time chasing dependency problems.
22
-
23
- Next, install TorToiSe and it's dependencies:
24
-
25
- ```shell
26
- git clone https://github.com/neonbjb/tortoise-tts.git
27
- cd tortoise-tts
28
- python setup.py install
29
- ```
30
-
31
- If you are on windows, you will also need to install pysoundfile: `conda install -c conda-forge pysoundfile`
32
-
33
- ### do_tts.py
34
-
35
- This script allows you to speak a single phrase with one or more voices.
36
- ```shell
37
- python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
38
- ```
39
-
40
- ### read.py
41
-
42
- This script provides tools for reading large amounts of text.
43
-
44
- ```shell
45
- python tortoise/read.py --textfile <your text to be read> --voice random
46
- ```
47
-
48
- This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
49
- of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
50
- output that as well.
51
-
52
- Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
53
- argument.
54
-
55
- ### API
56
-
57
- Tortoise can be used programmatically, like so:
58
-
59
- ```python
60
- reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
61
- tts = api.TextToSpeech()
62
- pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
63
- ```
64
-
65
- ## Voice customization guide
66
-
67
- Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
68
-
69
- These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
70
-
71
- ### Random voice
72
-
73
- I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run
74
- it. The results are quite fascinating and I recommend you play around with it!
75
-
76
- You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.
77
-
78
- For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.
79
-
80
- ### Provided voices
81
-
82
- This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform
83
- far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see
84
- what Tortoise can do for zero-shot mimicing, take a look at the others.
85
-
86
- ### Adding a new voice
87
-
88
- To add new voices to Tortoise, you will need to do the following:
89
-
90
- 1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
91
- 2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
92
- 3. Save the clips as a WAV file with floating point format and a 22,050 sample rate.
93
- 4. Create a subdirectory in voices/
94
- 5. Put your clips in that subdirectory.
95
- 6. Run tortoise utilities with --voice=<your_subdirectory_name>.
96
-
97
- ### Picking good reference clips
98
-
99
- As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking
100
- good clips:
101
-
102
- 1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
103
- 2. Avoid speeches. These generally have distortion caused by the amplification system.
104
- 3. Avoid clips from phone calls.
105
- 4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
106
- 5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
107
- 6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.
108
-
109
- ## Advanced Usage
110
-
111
- ### Generation settings
112
-
113
- Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
114
- that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
115
- various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
116
- set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
117
- these settings (and it's very likely that I missed something!)
118
-
119
- These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
120
- ```api.tts``` for a full list.
121
-
122
- ### Prompt engineering
123
-
124
- Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
125
- by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
126
- take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
127
- prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
128
-
129
- ### Playing with the voice latent
130
-
131
- Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
132
- then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
133
- are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
134
-
135
- This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
136
- what it thinks the "average" of those two voices sounds like.
137
-
138
- #### Generating conditioning latents from voices
139
-
140
- Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
141
- will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
142
-
143
- Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
144
-
145
- #### Using raw conditioning latents to generate speech
146
-
147
- After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
148
- ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
149
-
150
- ## Tortoise-detect
151
-
152
- Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
153
- came from Tortoise.
154
-
155
- This classifier can be run on any computer, usage is as follows:
156
-
157
- ```commandline
158
  python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
159
  ```
160
 
@@ -166,4 +17,4 @@ positives.
166
 
167
  Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
168
  models that work together:
169
- [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)
 
1
+ ---
2
+ title: 'Tortoise TTS And Voice Cloning '
3
+ sdk: gradio
4
+ emoji: 📊
5
+ colorFrom: green
6
+ colorTo: red
7
+ ---
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
10
  ```
11
 
 
17
 
18
  Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
19
  models that work together:
20
+ [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)