jbetker commited on
Commit
1cdadeb
2 Parent(s): 7c18fdf ee83259

Merge remote-tracking branch 'github/main'

Browse files

# Conflicts:
# .gitignore
# README.md
# requirements.txt
# tortoise_tts.ipynb
# tortoise_v2_examples.html

README.md CHANGED
@@ -7,6 +7,15 @@ Tortoise is a text-to-speech program built with the following priorities:
7
 
8
  This repo contains all the code needed to run Tortoise TTS in inference mode.
9
 
 
 
 
 
 
 
 
 
 
10
  ## What's in a name?
11
 
12
  I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
@@ -26,19 +35,22 @@ https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sh
26
 
27
  ### Installation
28
 
29
- If you want to use this on your own computer, you must have an NVIDIA GPU. Installation:
 
 
 
30
 
31
  ```shell
32
  git clone https://github.com/neonbjb/tortoise-tts.git
33
  cd tortoise-tts
34
- pip install -r requirements.txt
35
  ```
36
 
37
  ### do_tts.py
38
 
39
  This script allows you to speak a single phrase with one or more voices.
40
  ```shell
41
- python do_tts.py --text "I'm going to speak this" --voice dotrice --preset fast
42
  ```
43
 
44
  ### read.py
@@ -46,7 +58,7 @@ python do_tts.py --text "I'm going to speak this" --voice dotrice --preset fast
46
  This script provides tools for reading large amounts of text.
47
 
48
  ```shell
49
- python read.py --textfile <your text to be read> --voice dotrice
50
  ```
51
 
52
  This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
@@ -72,6 +84,15 @@ Tortoise was specifically trained to be a multi-speaker model. It accomplishes t
72
 
73
  These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
74
 
 
 
 
 
 
 
 
 
 
75
  ### Provided voices
76
 
77
  This repo comes with several pre-packaged voices. You will be familiar with many of them. :)
@@ -116,14 +137,33 @@ these settings (and it's very likely that I missed something!)
116
  These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
117
  ```api.tts``` for a full list.
118
 
 
 
 
 
 
 
 
119
  ### Playing with the voice latent
120
 
121
- Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting
122
- everything from tone to speaking rate to speech abnormalities.
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
- This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. You could also theoretically build a small extension to Tortoise that gradually shifts the
125
- latent from one speaker to another, then apply it across a bit of spoken text (something I havent implemented yet, but might
126
- get to soon!) I am sure there are other interesting things that can be done here. Please let me know what you find!
127
 
128
  ### Send me feedback!
129
 
@@ -140,7 +180,7 @@ came from Tortoise.
140
  This classifier can be run on any computer, usage is as follows:
141
 
142
  ```commandline
143
- python is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
144
  ```
145
 
146
  This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
 
7
 
8
  This repo contains all the code needed to run Tortoise TTS in inference mode.
9
 
10
+ ### New features
11
+
12
+ #### v2.1; 2022/5/2
13
+ - Added ability to produce totally random voices.
14
+ - Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
15
+ - Added ability to use your own pretrained models.
16
+ - Refactored directory structures.
17
+ - Performance improvements & bug fixes.
18
+
19
  ## What's in a name?
20
 
21
  I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
 
35
 
36
  ### Installation
37
 
38
+ If you want to use this on your own computer, you must have an NVIDIA GPU. First, install pytorch using these
39
+ instructions: [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
40
+
41
+ Then:
42
 
43
  ```shell
44
  git clone https://github.com/neonbjb/tortoise-tts.git
45
  cd tortoise-tts
46
+ python setup.py install
47
  ```
48
 
49
  ### do_tts.py
50
 
51
  This script allows you to speak a single phrase with one or more voices.
52
  ```shell
53
+ python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
54
  ```
55
 
56
  ### read.py
 
58
  This script provides tools for reading large amounts of text.
59
 
60
  ```shell
61
+ python tortoise/read.py --textfile <your text to be read> --voice random
62
  ```
63
 
64
  This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
 
84
 
85
  These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
86
 
87
+ ### Random voice
88
+
89
+ I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run
90
+ it. The results are quite fascinating and I recommend you play around with it!
91
+
92
+ You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.
93
+
94
+ For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.
95
+
96
  ### Provided voices
97
 
98
  This repo comes with several pre-packaged voices. You will be familiar with many of them. :)
 
137
  These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
138
  ```api.tts``` for a full list.
139
 
140
+ ### Prompt engineering
141
+
142
+ Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
143
+ by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
144
+ take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
145
+ prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
146
+
147
  ### Playing with the voice latent
148
 
149
+ Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
150
+ then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
151
+ are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
152
+
153
+ This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
154
+ what it thinks the "average" of those two voices sounds like.
155
+
156
+ #### Generating conditioning latents from voices
157
+
158
+ Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
159
+ will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
160
+
161
+ Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
162
+
163
+ #### Using raw conditioning latents to generate speech
164
 
165
+ After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
166
+ ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
 
167
 
168
  ### Send me feedback!
169
 
 
180
  This classifier can be run on any computer, usage is as follows:
181
 
182
  ```commandline
183
+ python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
184
  ```
185
 
186
  This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
requirements.txt CHANGED
@@ -1,5 +1,4 @@
1
- torch
2
- torchaudio
3
  rotary_embedding_torch
4
  transformers
5
  tokenizers
@@ -7,6 +6,5 @@ inflect
7
  progressbar
8
  einops
9
  unidecode
10
- entmax
11
  scipy
12
  librosa
 
1
+ tqdm
 
2
  rotary_embedding_torch
3
  transformers
4
  tokenizers
 
6
  progressbar
7
  einops
8
  unidecode
 
9
  scipy
10
  librosa
setup.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import setuptools
2
+
3
+ with open("README.md", "r", encoding="utf-8") as fh:
4
+ long_description = fh.read()
5
+
6
+ setuptools.setup(
7
+ name="TorToiSe",
8
+ packages=setuptools.find_packages(),
9
+ version="2.1.3",
10
+ author="James Betker",
11
+ author_email="james@adamant.ai",
12
+ description="A high quality multi-voice text-to-speech library",
13
+ long_description=long_description,
14
+ long_description_content_type="text/markdown",
15
+ url="https://github.com/neonbjb/tortoise-tts",
16
+ project_urls={},
17
+ install_requires=[
18
+ 'tqdm',
19
+ 'rotary_embedding_torch',
20
+ 'inflect',
21
+ 'progressbar',
22
+ 'einops',
23
+ 'unidecode',
24
+ 'scipy',
25
+ 'librosa',
26
+ 'transformers',
27
+ 'tokenizers',
28
+ ],
29
+ classifiers=[
30
+ "Programming Language :: Python :: 3",
31
+ "License :: OSI Approved :: Apache Software License",
32
+ "Operating System :: OS Independent",
33
+ ],
34
+ python_requires=">=3.6",
35
+ )
tortoise/__init__.py ADDED
File without changes
tortoise/api.py ADDED
@@ -0,0 +1,454 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import random
3
+ import uuid
4
+ from urllib import request
5
+
6
+ import torch
7
+ import torch.nn.functional as F
8
+ import progressbar
9
+ import torchaudio
10
+
11
+ from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead
12
+ from tortoise.models.cvvp import CVVP
13
+ from tortoise.models.diffusion_decoder import DiffusionTts
14
+ from tortoise.models.autoregressive import UnifiedVoice
15
+ from tqdm import tqdm
16
+
17
+ from tortoise.models.arch_util import TorchMelSpectrogram
18
+ from tortoise.models.clvp import CLVP
19
+ from tortoise.models.random_latent_generator import RandomLatentConverter
20
+ from tortoise.models.vocoder import UnivNetGenerator
21
+ from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel
22
+ from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule
23
+ from tortoise.utils.tokenizer import VoiceBpeTokenizer
24
+ from tortoise.utils.wav2vec_alignment import Wav2VecAlignment
25
+
26
+ pbar = None
27
+
28
+
29
+ def download_models(specific_models=None):
30
+ """
31
+ Call to download all the models that Tortoise uses.
32
+ """
33
+ MODELS = {
34
+ 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/autoregressive.pth',
35
+ 'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/classifier.pth',
36
+ 'clvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/clvp.pth',
37
+ 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/cvvp.pth',
38
+ 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/diffusion_decoder.pth',
39
+ 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/vocoder.pth',
40
+ 'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/rlg_auto.pth',
41
+ 'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/hf/.models/rlg_diffuser.pth',
42
+ }
43
+ os.makedirs('.models', exist_ok=True)
44
+ def show_progress(block_num, block_size, total_size):
45
+ global pbar
46
+ if pbar is None:
47
+ pbar = progressbar.ProgressBar(maxval=total_size)
48
+ pbar.start()
49
+
50
+ downloaded = block_num * block_size
51
+ if downloaded < total_size:
52
+ pbar.update(downloaded)
53
+ else:
54
+ pbar.finish()
55
+ pbar = None
56
+ for model_name, url in MODELS.items():
57
+ if specific_models is not None and model_name not in specific_models:
58
+ continue
59
+ if os.path.exists(f'.models/{model_name}'):
60
+ continue
61
+ print(f'Downloading {model_name} from {url}...')
62
+ request.urlretrieve(url, f'.models/{model_name}', show_progress)
63
+ print('Done.')
64
+
65
+
66
+ def pad_or_truncate(t, length):
67
+ """
68
+ Utility function for forcing <t> to have the specified sequence length, whether by clipping it or padding it with 0s.
69
+ """
70
+ if t.shape[-1] == length:
71
+ return t
72
+ elif t.shape[-1] < length:
73
+ return F.pad(t, (0, length-t.shape[-1]))
74
+ else:
75
+ return t[..., :length]
76
+
77
+
78
+ def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusion_steps=200, cond_free=True, cond_free_k=1):
79
+ """
80
+ Helper function to load a GaussianDiffusion instance configured for use as a vocoder.
81
+ """
82
+ return SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]), model_mean_type='epsilon',
83
+ model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps),
84
+ conditioning_free=cond_free, conditioning_free_k=cond_free_k)
85
+
86
+
87
+ def format_conditioning(clip, cond_length=132300):
88
+ """
89
+ Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
90
+ """
91
+ gap = clip.shape[-1] - cond_length
92
+ if gap < 0:
93
+ clip = F.pad(clip, pad=(0, abs(gap)))
94
+ elif gap > 0:
95
+ rand_start = random.randint(0, gap)
96
+ clip = clip[:, rand_start:rand_start + cond_length]
97
+ mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0)
98
+ return mel_clip.unsqueeze(0).cuda()
99
+
100
+
101
+ def fix_autoregressive_output(codes, stop_token, complain=True):
102
+ """
103
+ This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was
104
+ trained on and what the autoregressive code generator creates (which has no padding or end).
105
+ This is highly specific to the DVAE being used, so this particular coding will not necessarily work if used with
106
+ a different DVAE. This can be inferred by feeding a audio clip padded with lots of zeros on the end through the DVAE
107
+ and copying out the last few codes.
108
+
109
+ Failing to do this padding will produce speech with a harsh end that sounds like "BLAH" or similar.
110
+ """
111
+ # Strip off the autoregressive stop token and add padding.
112
+ stop_token_indices = (codes == stop_token).nonzero()
113
+ if len(stop_token_indices) == 0:
114
+ if complain:
115
+ print("No stop tokens found in one of the generated voice clips. This typically means the spoken audio is "
116
+ "too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, "
117
+ "try breaking up your input text.")
118
+ return codes
119
+ else:
120
+ codes[stop_token_indices] = 83
121
+ stm = stop_token_indices.min().item()
122
+ codes[stm:] = 83
123
+ if stm - 3 < codes.shape[0]:
124
+ codes[-3] = 45
125
+ codes[-2] = 45
126
+ codes[-1] = 248
127
+
128
+ return codes
129
+
130
+
131
+ def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True):
132
+ """
133
+ Uses the specified diffusion model to convert discrete codes into a spectrogram.
134
+ """
135
+ with torch.no_grad():
136
+ output_seq_len = latents.shape[1] * 4 * 24000 // 22050 # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal.
137
+ output_shape = (latents.shape[0], 100, output_seq_len)
138
+ precomputed_embeddings = diffusion_model.timestep_independent(latents, conditioning_latents, output_seq_len, False)
139
+
140
+ noise = torch.randn(output_shape, device=latents.device) * temperature
141
+ mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise,
142
+ model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
143
+ progress=verbose)
144
+ return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]
145
+
146
+
147
+ def classify_audio_clip(clip):
148
+ """
149
+ Returns whether or not Tortoises' classifier thinks the given clip came from Tortoise.
150
+ :param clip: torch tensor containing audio waveform data (get it from load_audio)
151
+ :return: True if the clip was classified as coming from Tortoise and false if it was classified as real.
152
+ """
153
+ download_models(['classifier.pth'])
154
+ classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4,
155
+ resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32,
156
+ dropout=0, kernel_size=5, distribute_zero_label=False)
157
+ classifier.load_state_dict(torch.load('.models/classifier.pth', map_location=torch.device('cpu')))
158
+ clip = clip.cpu().unsqueeze(0)
159
+ results = F.softmax(classifier(clip), dim=-1)
160
+ return results[0][0]
161
+
162
+
163
+ class TextToSpeech:
164
+ """
165
+ Main entry point into Tortoise.
166
+ """
167
+
168
+ def __init__(self, autoregressive_batch_size=16, models_dir='.models', enable_redaction=True):
169
+ """
170
+ Constructor
171
+ :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
172
+ GPU OOM errors. Larger numbers generates slightly faster.
173
+ :param models_dir: Where model weights are stored. This should only be specified if you are providing your own
174
+ models, otherwise use the defaults.
175
+ :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output
176
+ (but are still rendered by the model). This can be used for prompt engineering.
177
+ Default is true.
178
+ """
179
+ self.autoregressive_batch_size = autoregressive_batch_size
180
+ self.enable_redaction = enable_redaction
181
+ if self.enable_redaction:
182
+ self.aligner = Wav2VecAlignment()
183
+
184
+ self.tokenizer = VoiceBpeTokenizer()
185
+ download_models()
186
+
187
+ if os.path.exists(f'{models_dir}/autoregressive.ptt'):
188
+ # Assume this is a traced directory.
189
+ self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt')
190
+ self.diffusion = torch.jit.load(f'{models_dir}/diffusion_decoder.ptt')
191
+ else:
192
+ self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
193
+ model_dim=1024,
194
+ heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
195
+ train_solo_embeddings=False).cpu().eval()
196
+ self.autoregressive.load_state_dict(torch.load(f'{models_dir}/autoregressive.pth'))
197
+
198
+ self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200,
199
+ in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16,
200
+ layer_drop=0, unconditioned_percentage=0).cpu().eval()
201
+ self.diffusion.load_state_dict(torch.load(f'{models_dir}/diffusion_decoder.pth'))
202
+
203
+ self.clvp = CLVP(dim_text=512, dim_speech=512, dim_latent=512, num_text_tokens=256, text_enc_depth=12,
204
+ text_seq_len=350, text_heads=8,
205
+ num_speech_tokens=8192, speech_enc_depth=12, speech_heads=8, speech_seq_len=430,
206
+ use_xformers=True).cpu().eval()
207
+ self.clvp.load_state_dict(torch.load(f'{models_dir}/clvp.pth'))
208
+
209
+ self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0,
210
+ speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval()
211
+ self.cvvp.load_state_dict(torch.load(f'{models_dir}/cvvp.pth'))
212
+
213
+ self.vocoder = UnivNetGenerator().cpu()
214
+ self.vocoder.load_state_dict(torch.load(f'{models_dir}/vocoder.pth')['model_g'])
215
+ self.vocoder.eval(inference=True)
216
+
217
+ # Random latent generators (RLGs) are loaded lazily.
218
+ self.rlg_auto = None
219
+ self.rlg_diffusion = None
220
+
221
+ def get_conditioning_latents(self, voice_samples, return_mels=False):
222
+ """
223
+ Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
224
+ These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
225
+ properties.
226
+ :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data.
227
+ """
228
+ voice_samples = [v.to('cuda') for v in voice_samples]
229
+
230
+ auto_conds = []
231
+ if not isinstance(voice_samples, list):
232
+ voice_samples = [voice_samples]
233
+ for vs in voice_samples:
234
+ auto_conds.append(format_conditioning(vs))
235
+ auto_conds = torch.stack(auto_conds, dim=1)
236
+ self.autoregressive = self.autoregressive.cuda()
237
+ auto_latent = self.autoregressive.get_conditioning(auto_conds)
238
+ self.autoregressive = self.autoregressive.cpu()
239
+
240
+ diffusion_conds = []
241
+ for sample in voice_samples:
242
+ # The diffuser operates at a sample rate of 24000 (except for the latent inputs)
243
+ sample = torchaudio.functional.resample(sample, 22050, 24000)
244
+ sample = pad_or_truncate(sample, 102400)
245
+ cond_mel = wav_to_univnet_mel(sample.to('cuda'), do_normalization=False)
246
+ diffusion_conds.append(cond_mel)
247
+ diffusion_conds = torch.stack(diffusion_conds, dim=1)
248
+
249
+ self.diffusion = self.diffusion.cuda()
250
+ diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
251
+ self.diffusion = self.diffusion.cpu()
252
+
253
+ if return_mels:
254
+ return auto_latent, diffusion_latent, auto_conds, diffusion_conds
255
+ else:
256
+ return auto_latent, diffusion_latent
257
+
258
+ def get_random_conditioning_latents(self):
259
+ # Lazy-load the RLG models.
260
+ if self.rlg_auto is None:
261
+ self.rlg_auto = RandomLatentConverter(1024).eval()
262
+ self.rlg_auto.load_state_dict(torch.load('.models/rlg_auto.pth', map_location=torch.device('cpu')))
263
+ self.rlg_diffusion = RandomLatentConverter(2048).eval()
264
+ self.rlg_diffusion.load_state_dict(torch.load('.models/rlg_diffuser.pth', map_location=torch.device('cpu')))
265
+ with torch.no_grad():
266
+ return self.rlg_auto(torch.tensor([0.0])), self.rlg_diffusion(torch.tensor([0.0]))
267
+
268
+ def tts_with_preset(self, text, preset='fast', **kwargs):
269
+ """
270
+ Calls TTS with one of a set of preset generation parameters. Options:
271
+ 'ultra_fast': Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest).
272
+ 'fast': Decent quality speech at a decent inference rate. A good choice for mass inference.
273
+ 'standard': Very good quality. This is generally about as good as you are going to get.
274
+ 'high_quality': Use if you want the absolute best. This is not really worth the compute, though.
275
+ """
276
+ # Use generally found best tuning knobs for generation.
277
+ kwargs.update({'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0,
278
+ 'top_p': .8,
279
+ 'cond_free_k': 2.0, 'diffusion_temperature': 1.0})
280
+ # Presets are defined here.
281
+ presets = {
282
+ 'ultra_fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False},
283
+ 'fast': {'num_autoregressive_samples': 96, 'diffusion_iterations': 80},
284
+ 'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200},
285
+ 'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400},
286
+ }
287
+ kwargs.update(presets[preset])
288
+ return self.tts(text, **kwargs)
289
+
290
+ def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True,
291
+ # autoregressive generation parameters follow
292
+ num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
293
+ # CLVP & CVVP parameters
294
+ clvp_cvvp_slider=.5,
295
+ # diffusion generation parameters follow
296
+ diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0,
297
+ **hf_generate_kwargs):
298
+ """
299
+ Produces an audio clip of the given text being spoken with the given reference voice.
300
+ :param text: Text to be spoken.
301
+ :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
302
+ :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
303
+ can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
304
+ Conditioning latents can be retrieved via get_conditioning_latents().
305
+ :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP and CVVP models) clips are returned.
306
+ :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
307
+ ~~AUTOREGRESSIVE KNOBS~~
308
+ :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP+CVVP.
309
+ As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
310
+ :param temperature: The softmax temperature of the autoregressive model.
311
+ :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
312
+ :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
313
+ of long silences or "uhhhhhhs", etc.
314
+ :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
315
+ :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
316
+ :param typical_sampling: Turns typical sampling on or off. This sampling mode is discussed in this paper: https://arxiv.org/abs/2202.00666
317
+ I was interested in the premise, but the results were not as good as I was hoping. This is off by default, but
318
+ could use some tuning.
319
+ :param typical_mass: The typical_mass parameter from the typical_sampling algorithm.
320
+ ~~CLVP-CVVP KNOBS~~
321
+ :param clvp_cvvp_slider: Controls the influence of the CLVP and CVVP models in selecting the best output from the autoregressive model.
322
+ [0,1]. Values closer to 1 will cause Tortoise to emit clips that follow the text more. Values closer to
323
+ 0 will cause Tortoise to emit clips that more closely follow the reference clip (e.g. the voice sounds more
324
+ similar).
325
+ ~~DIFFUSION KNOBS~~
326
+ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
327
+ the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
328
+ however.
329
+ :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
330
+ each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
331
+ of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
332
+ dramatically improves realism.
333
+ :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
334
+ As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
335
+ Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
336
+ :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
337
+ are the "mean" prediction of the diffusion network and will sound bland and smeared.
338
+ ~~OTHER STUFF~~
339
+ :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
340
+ Extra keyword args fed to this function get forwarded directly to that API. Documentation
341
+ here: https://huggingface.co/docs/transformers/internal/generation_utils
342
+ :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
343
+ Sample rate is 24kHz.
344
+ """
345
+ text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).cuda()
346
+ text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary.
347
+ assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
348
+
349
+ auto_conds = None
350
+ if voice_samples is not None:
351
+ auto_conditioning, diffusion_conditioning, auto_conds, _ = self.get_conditioning_latents(voice_samples, return_mels=True)
352
+ elif conditioning_latents is not None:
353
+ auto_conditioning, diffusion_conditioning = conditioning_latents
354
+ else:
355
+ auto_conditioning, diffusion_conditioning = self.get_random_conditioning_latents()
356
+ auto_conditioning = auto_conditioning.cuda()
357
+ diffusion_conditioning = diffusion_conditioning.cuda()
358
+
359
+ diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k)
360
+
361
+ with torch.no_grad():
362
+ samples = []
363
+ num_batches = num_autoregressive_samples // self.autoregressive_batch_size
364
+ stop_mel_token = self.autoregressive.stop_mel_token
365
+ calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
366
+ self.autoregressive = self.autoregressive.cuda()
367
+ if verbose:
368
+ print("Generating autoregressive samples..")
369
+ for b in tqdm(range(num_batches), disable=not verbose):
370
+ codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
371
+ do_sample=True,
372
+ top_p=top_p,
373
+ temperature=temperature,
374
+ num_return_sequences=self.autoregressive_batch_size,
375
+ length_penalty=length_penalty,
376
+ repetition_penalty=repetition_penalty,
377
+ max_generate_length=max_mel_tokens,
378
+ **hf_generate_kwargs)
379
+ padding_needed = max_mel_tokens - codes.shape[1]
380
+ codes = F.pad(codes, (0, padding_needed), value=stop_mel_token)
381
+ samples.append(codes)
382
+ self.autoregressive = self.autoregressive.cpu()
383
+
384
+ clip_results = []
385
+ self.clvp = self.clvp.cuda()
386
+ self.cvvp = self.cvvp.cuda()
387
+ if verbose:
388
+ print("Computing best candidates using CLVP and CVVP")
389
+ for batch in tqdm(samples, disable=not verbose):
390
+ for i in range(batch.shape[0]):
391
+ batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
392
+ clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
393
+ if auto_conds is not None:
394
+ cvvp_accumulator = 0
395
+ for cl in range(auto_conds.shape[1]):
396
+ cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False)
397
+ cvvp = cvvp_accumulator / auto_conds.shape[1]
398
+ clip_results.append(clvp * clvp_cvvp_slider + cvvp * (1-clvp_cvvp_slider))
399
+ else:
400
+ clip_results.append(clvp)
401
+ clip_results = torch.cat(clip_results, dim=0)
402
+ samples = torch.cat(samples, dim=0)
403
+ best_results = samples[torch.topk(clip_results, k=k).indices]
404
+ self.clvp = self.clvp.cpu()
405
+ self.cvvp = self.cvvp.cpu()
406
+ del samples
407
+
408
+ # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning
409
+ # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these
410
+ # results, but will increase memory usage.
411
+ self.autoregressive = self.autoregressive.cuda()
412
+ best_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
413
+ torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
414
+ torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
415
+ return_latent=True, clip_inputs=False)
416
+ self.autoregressive = self.autoregressive.cpu()
417
+ del auto_conditioning
418
+
419
+ if verbose:
420
+ print("Transforming autoregressive outputs into audio..")
421
+ wav_candidates = []
422
+ self.diffusion = self.diffusion.cuda()
423
+ self.vocoder = self.vocoder.cuda()
424
+ for b in range(best_results.shape[0]):
425
+ codes = best_results[b].unsqueeze(0)
426
+ latents = best_latents[b].unsqueeze(0)
427
+
428
+ # Find the first occurrence of the "calm" token and trim the codes to that.
429
+ ctokens = 0
430
+ for k in range(codes.shape[-1]):
431
+ if codes[0, k] == calm_token:
432
+ ctokens += 1
433
+ else:
434
+ ctokens = 0
435
+ if ctokens > 8: # 8 tokens gives the diffusion model some "breathing room" to terminate speech.
436
+ latents = latents[:, :k]
437
+ break
438
+
439
+ mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, diffusion_conditioning,
440
+ temperature=diffusion_temperature, verbose=verbose)
441
+ wav = self.vocoder.inference(mel)
442
+ wav_candidates.append(wav.cpu())
443
+ self.diffusion = self.diffusion.cpu()
444
+ self.vocoder = self.vocoder.cpu()
445
+
446
+ def potentially_redact(clip, text):
447
+ if self.enable_redaction:
448
+ return self.aligner.redact(clip.squeeze(1), text).unsqueeze(1)
449
+ return clip
450
+ wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates]
451
+ if len(wav_candidates) > 1:
452
+ return wav_candidates
453
+ return wav_candidates[0]
454
+
tortoise/data/mel_norms.pth ADDED
Binary file (1.07 kB). View file
 
tortoise/data/riding_hood.txt ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her. It suited the girl so extremely well that everybody called her Little Red Riding Hood.
2
+ One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter."
3
+
4
+ Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village.
5
+
6
+ As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest. He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother."
7
+
8
+ "Does she live far off?" said the wolf
9
+
10
+ "Oh I say," answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village."
11
+
12
+ "Well," said the wolf, "and I'll go and see her too. I'll go this way and go you that, and we shall see who will be there first."
13
+
14
+ The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers. It was not long before the wolf arrived at the old woman's house. He knocked at the door: tap, tap.
15
+
16
+ "Who's there?"
17
+
18
+ "Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother."
19
+
20
+ The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up."
21
+
22
+ The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten. He then shut the door and got into the grandmother's bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap.
23
+
24
+ "Who's there?"
25
+
26
+ Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you."
27
+
28
+ The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up."
29
+
30
+ Little Red Riding Hood pulled the bobbin, and the door opened.
31
+
32
+ The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me."
33
+
34
+ Little Red Riding Hood took off her clothes and got into bed. She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!"
35
+
36
+ "All the better to hug you with, my dear."
37
+
38
+ "Grandmother, what big legs you have!"
39
+
40
+ "All the better to run with, my child."
41
+
42
+ "Grandmother, what big ears you have!"
43
+
44
+ "All the better to hear with, my child."
45
+
46
+ "Grandmother, what big eyes you have!"
47
+
48
+ "All the better to see with, my child."
49
+
50
+ "Grandmother, what big teeth you have got!"
51
+
52
+ "All the better to eat you up with."
53
+
54
+ And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up.
tortoise/data/seal_copypasta.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al kayda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire U S armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the U S A and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo.
tortoise/data/tokenizer.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"version":"1.0","truncation":null,"padding":null,"added_tokens":[{"id":0,"special":true,"content":"[STOP]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":1,"special":true,"content":"[UNK]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":2,"special":true,"content":"[SPACE]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false}],"normalizer":null,"pre_tokenizer":{"type":"Whitespace"},"post_processor":null,"decoder":null,"model":{"type":"BPE","dropout":null,"unk_token":"[UNK]","continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"[STOP]":0,"[UNK]":1,"[SPACE]":2,"!":3,"'":4,"(":5,")":6,",":7,"-":8,".":9,"/":10,":":11,";":12,"?":13,"a":14,"b":15,"c":16,"d":17,"e":18,"f":19,"g":20,"h":21,"i":22,"j":23,"k":24,"l":25,"m":26,"n":27,"o":28,"p":29,"q":30,"r":31,"s":32,"t":33,"u":34,"v":35,"w":36,"x":37,"y":38,"z":39,"th":40,"in":41,"the":42,"an":43,"er":44,"ou":45,"re":46,"on":47,"at":48,"ed":49,"en":50,"to":51,"ing":52,"and":53,"is":54,"as":55,"al":56,"or":57,"of":58,"ar":59,"it":60,"es":61,"he":62,"st":63,"le":64,"om":65,"se":66,"be":67,"ad":68,"ow":69,"ly":70,"ch":71,"wh":72,"that":73,"you":74,"li":75,"ve":76,"ac":77,"ti":78,"ld":79,"me":80,"was":81,"gh":82,"id":83,"ll":84,"wi":85,"ent":86,"for":87,"ay":88,"ro":89,"ver":90,"ic":91,"her":92,"ke":93,"his":94,"no":95,"ut":96,"un":97,"ir":98,"lo":99,"we":100,"ri":101,"ha":102,"with":103,"ght":104,"out":105,"im":106,"ion":107,"all":108,"ab":109,"one":110,"ne":111,"ge":112,"ould":113,"ter":114,"mo":115,"had":116,"ce":117,"she":118,"go":119,"sh":120,"ur":121,"am":122,"so":123,"pe":124,"my":125,"de":126,"are":127,"but":128,"ome":129,"fr":130,"ther":131,"fe":132,"su":133,"do":134,"con":135,"te":136,"ain":137,"ere":138,"po":139,"if":140,"they":141,"us":142,"ag":143,"tr":144,"now":145,"oun":146,"this":147,"have":148,"not":149,"sa":150,"il":151,"up":152,"thing":153,"from":154,"ap":155,"him":156,"ack":157,"ation":158,"ant":159,"our":160,"op":161,"like":162,"ust":163,"ess":164,"bo":165,"ok":166,"ul":167,"ind":168,"ex":169,"com":170,"some":171,"there":172,"ers":173,"co":174,"res":175,"man":176,"ard":177,"pl":178,"wor":179,"way":180,"tion":181,"fo":182,"ca":183,"were":184,"by":185,"ate":186,"pro":187,"ted":188,"ound":189,"own":190,"would":191,"ts":192,"what":193,"qu":194,"ally":195,"ight":196,"ck":197,"gr":198,"when":199,"ven":200,"can":201,"ough":202,"ine":203,"end":204,"per":205,"ous":206,"od":207,"ide":208,"know":209,"ty":210,"very":211,"si":212,"ak":213,"who":214,"about":215,"ill":216,"them":217,"est":218,"red":219,"ye":220,"could":221,"ong":222,"your":223,"their":224,"em":225,"just":226,"other":227,"into":228,"any":229,"whi":230,"um":231,"tw":232,"ast":233,"der":234,"did":235,"ie":236,"been":237,"ace":238,"ink":239,"ity":240,"back":241,"ting":242,"br":243,"more":244,"ake":245,"pp":246,"then":247,"sp":248,"el":249,"use":250,"bl":251,"said":252,"over":253,"get":254},"merges":["t h","i n","th e","a n","e r","o u","r e","o n","a t","e d","e n","t o","in g","an d","i s","a s","a l","o r","o f","a r","i t","e s","h e","s t","l e","o m","s e","b e","a d","o w","l y","c h","w h","th at","y ou","l i","v e","a c","t i","l d","m e","w as","g h","i d","l l","w i","en t","f or","a y","r o","v er","i c","h er","k e","h is","n o","u t","u n","i r","l o","w e","r i","h a","wi th","gh t","ou t","i m","i on","al l","a b","on e","n e","g e","ou ld","t er","m o","h ad","c e","s he","g o","s h","u r","a m","s o","p e","m y","d e","a re","b ut","om e","f r","the r","f e","s u","d o","c on","t e","a in","er e","p o","i f","the y","u s","a g","t r","n ow","ou n","th is","ha ve","no t","s a","i l","u p","th ing","fr om","a p","h im","ac k","at ion","an t","ou r","o p","li ke","u st","es s","b o","o k","u l","in d","e x","c om","s ome","the re","er s","c o","re s","m an","ar d","p l","w or","w ay","ti on","f o","c a","w ere","b y","at e","p ro","t ed","oun d","ow n","w ould","t s","wh at","q u","al ly","i ght","c k","g r","wh en","v en","c an","ou gh","in e","en d","p er","ou s","o d","id e","k now","t y","ver y","s i","a k","wh o","ab out","i ll","the m","es t","re d","y e","c ould","on g","you r","the ir","e m","j ust","o ther","in to","an y","wh i","u m","t w","as t","d er","d id","i e","be en","ac e","in k","it y","b ack","t ing","b r","mo re","a ke","p p","the n","s p","e l","u se","b l","sa id","o ver","ge t"]}}
tortoise/do_tts.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+
4
+ import torchaudio
5
+
6
+ from api import TextToSpeech
7
+ from tortoise.utils.audio import load_audio, get_voices, load_voice
8
+
9
+ if __name__ == '__main__':
10
+ parser = argparse.ArgumentParser()
11
+ parser.add_argument('--text', type=str, help='Text to speak.', default="The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.")
12
+ parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
13
+ 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')
14
+ parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast')
15
+ parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
16
+ help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
17
+ default=.5)
18
+ parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')
19
+ parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
20
+ 'should only be specified if you have custom checkpoints.', default='.models')
21
+ args = parser.parse_args()
22
+ os.makedirs(args.output_path, exist_ok=True)
23
+
24
+ tts = TextToSpeech(models_dir=args.model_dir)
25
+
26
+ selected_voices = args.voice.split(',')
27
+ for k, voice in enumerate(selected_voices):
28
+ voice_samples, conditioning_latents = load_voice(voice)
29
+ gen = tts.tts_with_preset(args.text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
30
+ preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
31
+ torchaudio.save(os.path.join(args.output_path, f'{voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000)
32
+
tortoise/get_conditioning_latents.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import torch
4
+
5
+ from api import TextToSpeech
6
+ from tortoise.utils.audio import load_audio, get_voices
7
+
8
+ """
9
+ Dumps the conditioning latents for the specified voice to disk. These are expressive latents which can be used for
10
+ other ML models, or can be augmented manually and fed back into Tortoise to affect vocal qualities.
11
+ """
12
+ if __name__ == '__main__':
13
+ parser = argparse.ArgumentParser()
14
+ parser.add_argument('--voice', type=str, help='Selects the voice to convert to conditioning latents', default='pat2')
15
+ parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='../results/conditioning_latents')
16
+ args = parser.parse_args()
17
+ os.makedirs(args.output_path, exist_ok=True)
18
+
19
+ tts = TextToSpeech()
20
+ voices = get_voices()
21
+ selected_voices = args.voice.split(',')
22
+ for voice in selected_voices:
23
+ cond_paths = voices[voice]
24
+ conds = []
25
+ for cond_path in cond_paths:
26
+ c = load_audio(cond_path, 22050)
27
+ conds.append(c)
28
+ conditioning_latents = tts.get_conditioning_latents(conds)
29
+ torch.save(conditioning_latents, os.path.join(args.output_path, f'{voice}.pth'))
30
+
tortoise/is_this_from_tortoise.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+
3
+ from api import classify_audio_clip
4
+ from tortoise.utils.audio import load_audio
5
+
6
+ if __name__ == '__main__':
7
+ parser = argparse.ArgumentParser()
8
+ parser.add_argument('--clip', type=str, help='Path to an audio clip to classify.', default="../examples/favorite_riding_hood.mp3")
9
+ args = parser.parse_args()
10
+
11
+ clip = load_audio(args.clip, 24000)
12
+ clip = clip[:, :220000]
13
+ prob = classify_audio_clip(clip)
14
+ print(f"This classifier thinks there is a {prob*100}% chance that this clip was generated from Tortoise.")
tortoise/models/__init__.py ADDED
File without changes
tortoise/models/arch_util.py ADDED
@@ -0,0 +1,367 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import functools
2
+ import math
3
+
4
+ import torch
5
+ import torch.nn as nn
6
+ import torch.nn.functional as F
7
+ import torchaudio
8
+ from tortoise.models.xtransformers import ContinuousTransformerWrapper, RelativePositionBias
9
+
10
+
11
+ def zero_module(module):
12
+ """
13
+ Zero out the parameters of a module and return it.
14
+ """
15
+ for p in module.parameters():
16
+ p.detach().zero_()
17
+ return module
18
+
19
+
20
+ class GroupNorm32(nn.GroupNorm):
21
+ def forward(self, x):
22
+ return super().forward(x.float()).type(x.dtype)
23
+
24
+
25
+ def normalization(channels):
26
+ """
27
+ Make a standard normalization layer.
28
+
29
+ :param channels: number of input channels.
30
+ :return: an nn.Module for normalization.
31
+ """
32
+ groups = 32
33
+ if channels <= 16:
34
+ groups = 8
35
+ elif channels <= 64:
36
+ groups = 16
37
+ while channels % groups != 0:
38
+ groups = int(groups / 2)
39
+ assert groups > 2
40
+ return GroupNorm32(groups, channels)
41
+
42
+
43
+ class QKVAttentionLegacy(nn.Module):
44
+ """
45
+ A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping
46
+ """
47
+
48
+ def __init__(self, n_heads):
49
+ super().__init__()
50
+ self.n_heads = n_heads
51
+
52
+ def forward(self, qkv, mask=None, rel_pos=None):
53
+ """
54
+ Apply QKV attention.
55
+
56
+ :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.
57
+ :return: an [N x (H * C) x T] tensor after attention.
58
+ """
59
+ bs, width, length = qkv.shape
60
+ assert width % (3 * self.n_heads) == 0
61
+ ch = width // (3 * self.n_heads)
62
+ q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1)
63
+ scale = 1 / math.sqrt(math.sqrt(ch))
64
+ weight = torch.einsum(
65
+ "bct,bcs->bts", q * scale, k * scale
66
+ ) # More stable with f16 than dividing afterwards
67
+ if rel_pos is not None:
68
+ weight = rel_pos(weight.reshape(bs, self.n_heads, weight.shape[-2], weight.shape[-1])).reshape(bs * self.n_heads, weight.shape[-2], weight.shape[-1])
69
+ weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
70
+ if mask is not None:
71
+ # The proper way to do this is to mask before the softmax using -inf, but that doesn't work properly on CPUs.
72
+ mask = mask.repeat(self.n_heads, 1).unsqueeze(1)
73
+ weight = weight * mask
74
+ a = torch.einsum("bts,bcs->bct", weight, v)
75
+
76
+ return a.reshape(bs, -1, length)
77
+
78
+
79
+ class AttentionBlock(nn.Module):
80
+ """
81
+ An attention block that allows spatial positions to attend to each other.
82
+
83
+ Originally ported from here, but adapted to the N-d case.
84
+ https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
85
+ """
86
+
87
+ def __init__(
88
+ self,
89
+ channels,
90
+ num_heads=1,
91
+ num_head_channels=-1,
92
+ do_checkpoint=True,
93
+ relative_pos_embeddings=False,
94
+ ):
95
+ super().__init__()
96
+ self.channels = channels
97
+ self.do_checkpoint = do_checkpoint
98
+ if num_head_channels == -1:
99
+ self.num_heads = num_heads
100
+ else:
101
+ assert (
102
+ channels % num_head_channels == 0
103
+ ), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}"
104
+ self.num_heads = channels // num_head_channels
105
+ self.norm = normalization(channels)
106
+ self.qkv = nn.Conv1d(channels, channels * 3, 1)
107
+ # split heads before split qkv
108
+ self.attention = QKVAttentionLegacy(self.num_heads)
109
+
110
+ self.proj_out = zero_module(nn.Conv1d(channels, channels, 1))
111
+ if relative_pos_embeddings:
112
+ self.relative_pos_embeddings = RelativePositionBias(scale=(channels // self.num_heads) ** .5, causal=False, heads=num_heads, num_buckets=32, max_distance=64)
113
+ else:
114
+ self.relative_pos_embeddings = None
115
+
116
+ def forward(self, x, mask=None):
117
+ b, c, *spatial = x.shape
118
+ x = x.reshape(b, c, -1)
119
+ qkv = self.qkv(self.norm(x))
120
+ h = self.attention(qkv, mask, self.relative_pos_embeddings)
121
+ h = self.proj_out(h)
122
+ return (x + h).reshape(b, c, *spatial)
123
+
124
+
125
+ class Upsample(nn.Module):
126
+ """
127
+ An upsampling layer with an optional convolution.
128
+
129
+ :param channels: channels in the inputs and outputs.
130
+ :param use_conv: a bool determining if a convolution is applied.
131
+ """
132
+
133
+ def __init__(self, channels, use_conv, out_channels=None, factor=4):
134
+ super().__init__()
135
+ self.channels = channels
136
+ self.out_channels = out_channels or channels
137
+ self.use_conv = use_conv
138
+ self.factor = factor
139
+ if use_conv:
140
+ ksize = 5
141
+ pad = 2
142
+ self.conv = nn.Conv1d(self.channels, self.out_channels, ksize, padding=pad)
143
+
144
+ def forward(self, x):
145
+ assert x.shape[1] == self.channels
146
+ x = F.interpolate(x, scale_factor=self.factor, mode="nearest")
147
+ if self.use_conv:
148
+ x = self.conv(x)
149
+ return x
150
+
151
+
152
+ class Downsample(nn.Module):
153
+ """
154
+ A downsampling layer with an optional convolution.
155
+
156
+ :param channels: channels in the inputs and outputs.
157
+ :param use_conv: a bool determining if a convolution is applied.
158
+ """
159
+
160
+ def __init__(self, channels, use_conv, out_channels=None, factor=4, ksize=5, pad=2):
161
+ super().__init__()
162
+ self.channels = channels
163
+ self.out_channels = out_channels or channels
164
+ self.use_conv = use_conv
165
+
166
+ stride = factor
167
+ if use_conv:
168
+ self.op = nn.Conv1d(
169
+ self.channels, self.out_channels, ksize, stride=stride, padding=pad
170
+ )
171
+ else:
172
+ assert self.channels == self.out_channels
173
+ self.op = nn.AvgPool1d(kernel_size=stride, stride=stride)
174
+
175
+ def forward(self, x):
176
+ assert x.shape[1] == self.channels
177
+ return self.op(x)
178
+
179
+
180
+ class ResBlock(nn.Module):
181
+ def __init__(
182
+ self,
183
+ channels,
184
+ dropout,
185
+ out_channels=None,
186
+ use_conv=False,
187
+ use_scale_shift_norm=False,
188
+ up=False,
189
+ down=False,
190
+ kernel_size=3,
191
+ ):
192
+ super().__init__()
193
+ self.channels = channels
194
+ self.dropout = dropout
195
+ self.out_channels = out_channels or channels
196
+ self.use_conv = use_conv
197
+ self.use_scale_shift_norm = use_scale_shift_norm
198
+ padding = 1 if kernel_size == 3 else 2
199
+
200
+ self.in_layers = nn.Sequential(
201
+ normalization(channels),
202
+ nn.SiLU(),
203
+ nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding),
204
+ )
205
+
206
+ self.updown = up or down
207
+
208
+ if up:
209
+ self.h_upd = Upsample(channels, False)
210
+ self.x_upd = Upsample(channels, False)
211
+ elif down:
212
+ self.h_upd = Downsample(channels, False)
213
+ self.x_upd = Downsample(channels, False)
214
+ else:
215
+ self.h_upd = self.x_upd = nn.Identity()
216
+
217
+ self.out_layers = nn.Sequential(
218
+ normalization(self.out_channels),
219
+ nn.SiLU(),
220
+ nn.Dropout(p=dropout),
221
+ zero_module(
222
+ nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding)
223
+ ),
224
+ )
225
+
226
+ if self.out_channels == channels:
227
+ self.skip_connection = nn.Identity()
228
+ elif use_conv:
229
+ self.skip_connection = nn.Conv1d(
230
+ channels, self.out_channels, kernel_size, padding=padding
231
+ )
232
+ else:
233
+ self.skip_connection = nn.Conv1d(channels, self.out_channels, 1)
234
+
235
+ def forward(self, x):
236
+ if self.updown:
237
+ in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
238
+ h = in_rest(x)
239
+ h = self.h_upd(h)
240
+ x = self.x_upd(x)
241
+ h = in_conv(h)
242
+ else:
243
+ h = self.in_layers(x)
244
+ h = self.out_layers(h)
245
+ return self.skip_connection(x) + h
246
+
247
+
248
+ class AudioMiniEncoder(nn.Module):
249
+ def __init__(self,
250
+ spec_dim,
251
+ embedding_dim,
252
+ base_channels=128,
253
+ depth=2,
254
+ resnet_blocks=2,
255
+ attn_blocks=4,
256
+ num_attn_heads=4,
257
+ dropout=0,
258
+ downsample_factor=2,
259
+ kernel_size=3):
260
+ super().__init__()
261
+ self.init = nn.Sequential(
262
+ nn.Conv1d(spec_dim, base_channels, 3, padding=1)
263
+ )
264
+ ch = base_channels
265
+ res = []
266
+ for l in range(depth):
267
+ for r in range(resnet_blocks):
268
+ res.append(ResBlock(ch, dropout, kernel_size=kernel_size))
269
+ res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor))
270
+ ch *= 2
271
+ self.res = nn.Sequential(*res)
272
+ self.final = nn.Sequential(
273
+ normalization(ch),
274
+ nn.SiLU(),
275
+ nn.Conv1d(ch, embedding_dim, 1)
276
+ )
277
+ attn = []
278
+ for a in range(attn_blocks):
279
+ attn.append(AttentionBlock(embedding_dim, num_attn_heads,))
280
+ self.attn = nn.Sequential(*attn)
281
+ self.dim = embedding_dim
282
+
283
+ def forward(self, x):
284
+ h = self.init(x)
285
+ h = self.res(h)
286
+ h = self.final(h)
287
+ h = self.attn(h)
288
+ return h[:, :, 0]
289
+
290
+
291
+ class TorchMelSpectrogram(nn.Module):
292
+ def __init__(self, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0, mel_fmax=8000,
293
+ sampling_rate=22050, normalize=False, mel_norm_file='tortoise/data/mel_norms.pth'):
294
+ super().__init__()
295
+ # These are the default tacotron values for the MEL spectrogram.
296
+ self.filter_length = filter_length
297
+ self.hop_length = hop_length
298
+ self.win_length = win_length
299
+ self.n_mel_channels = n_mel_channels
300
+ self.mel_fmin = mel_fmin
301
+ self.mel_fmax = mel_fmax
302
+ self.sampling_rate = sampling_rate
303
+ self.mel_stft = torchaudio.transforms.MelSpectrogram(n_fft=self.filter_length, hop_length=self.hop_length,
304
+ win_length=self.win_length, power=2, normalized=normalize,
305
+ sample_rate=self.sampling_rate, f_min=self.mel_fmin,
306
+ f_max=self.mel_fmax, n_mels=self.n_mel_channels,
307
+ norm="slaney")
308
+ self.mel_norm_file = mel_norm_file
309
+ if self.mel_norm_file is not None:
310
+ self.mel_norms = torch.load(self.mel_norm_file)
311
+ else:
312
+ self.mel_norms = None
313
+
314
+ def forward(self, inp):
315
+ if len(inp.shape) == 3: # Automatically squeeze out the channels dimension if it is present (assuming mono-audio)
316
+ inp = inp.squeeze(1)
317
+ assert len(inp.shape) == 2
318
+ self.mel_stft = self.mel_stft.to(inp.device)
319
+ mel = self.mel_stft(inp)
320
+ # Perform dynamic range compression
321
+ mel = torch.log(torch.clamp(mel, min=1e-5))
322
+ if self.mel_norms is not None:
323
+ self.mel_norms = self.mel_norms.to(mel.device)
324
+ mel = mel / self.mel_norms.unsqueeze(0).unsqueeze(-1)
325
+ return mel
326
+
327
+
328
+ class CheckpointedLayer(nn.Module):
329
+ """
330
+ Wraps a module. When forward() is called, passes kwargs that require_grad through torch.checkpoint() and bypasses
331
+ checkpoint for all other args.
332
+ """
333
+ def __init__(self, wrap):
334
+ super().__init__()
335
+ self.wrap = wrap
336
+
337
+ def forward(self, x, *args, **kwargs):
338
+ for k, v in kwargs.items():
339
+ assert not (isinstance(v, torch.Tensor) and v.requires_grad) # This would screw up checkpointing.
340
+ partial = functools.partial(self.wrap, **kwargs)
341
+ return torch.utils.checkpoint.checkpoint(partial, x, *args)
342
+
343
+
344
+ class CheckpointedXTransformerEncoder(nn.Module):
345
+ """
346
+ Wraps a ContinuousTransformerWrapper and applies CheckpointedLayer to each layer and permutes from channels-mid
347
+ to channels-last that XTransformer expects.
348
+ """
349
+ def __init__(self, needs_permute=True, exit_permute=True, checkpoint=True, **xtransformer_kwargs):
350
+ super().__init__()
351
+ self.transformer = ContinuousTransformerWrapper(**xtransformer_kwargs)
352
+ self.needs_permute = needs_permute
353
+ self.exit_permute = exit_permute
354
+
355
+ if not checkpoint:
356
+ return
357
+ for i in range(len(self.transformer.attn_layers.layers)):
358
+ n, b, r = self.transformer.attn_layers.layers[i]
359
+ self.transformer.attn_layers.layers[i] = nn.ModuleList([n, CheckpointedLayer(b), r])
360
+
361
+ def forward(self, x, **kwargs):
362
+ if self.needs_permute:
363
+ x = x.permute(0,2,1)
364
+ h = self.transformer(x, **kwargs)
365
+ if self.exit_permute:
366
+ h = h.permute(0,2,1)
367
+ return h
tortoise/models/autoregressive.py ADDED
@@ -0,0 +1,511 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import functools
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+ from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList
7
+ from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
8
+ from transformers.utils.model_parallel_utils import get_device_map, assert_device_map
9
+ from tortoise.models.arch_util import AttentionBlock
10
+ from tortoise.utils.typical_sampling import TypicalLogitsWarper
11
+
12
+
13
+ def null_position_embeddings(range, dim):
14
+ return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device)
15
+
16
+
17
+ class ResBlock(nn.Module):
18
+ """
19
+ Basic residual convolutional block that uses GroupNorm.
20
+ """
21
+ def __init__(self, chan):
22
+ super().__init__()
23
+ self.net = nn.Sequential(
24
+ nn.Conv1d(chan, chan, kernel_size=3, padding=1),
25
+ nn.GroupNorm(chan//8, chan),
26
+ nn.ReLU(),
27
+ nn.Conv1d(chan, chan, kernel_size=3, padding=1),
28
+ nn.GroupNorm(chan//8, chan)
29
+ )
30
+
31
+ def forward(self, x):
32
+ return F.relu(self.net(x) + x)
33
+
34
+
35
+ class GPT2InferenceModel(GPT2PreTrainedModel):
36
+ def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear):
37
+ super().__init__(config)
38
+ self.transformer = gpt
39
+ self.text_pos_embedding = text_pos_emb
40
+ self.embeddings = embeddings
41
+ self.lm_head = nn.Sequential(norm, linear)
42
+
43
+ # Model parallel
44
+ self.model_parallel = False
45
+ self.device_map = None
46
+ self.cached_mel_emb = None
47
+
48
+ def parallelize(self, device_map=None):
49
+ self.device_map = (
50
+ get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))
51
+ if device_map is None
52
+ else device_map
53
+ )
54
+ assert_device_map(self.device_map, len(self.transformer.h))
55
+ self.transformer.parallelize(self.device_map)
56
+ self.lm_head = self.lm_head.to(self.transformer.first_device)
57
+ self.model_parallel = True
58
+
59
+ def deparallelize(self):
60
+ self.transformer.deparallelize()
61
+ self.transformer = self.transformer.to("cpu")
62
+ self.lm_head = self.lm_head.to("cpu")
63
+ self.model_parallel = False
64
+ torch.cuda.empty_cache()
65
+
66
+ def get_output_embeddings(self):
67
+ return self.lm_head
68
+
69
+ def set_output_embeddings(self, new_embeddings):
70
+ self.lm_head = new_embeddings
71
+
72
+ def store_mel_emb(self, mel_emb):
73
+ self.cached_mel_emb = mel_emb
74
+
75
+ def prepare_inputs_for_generation(self, input_ids, past=None, **kwargs):
76
+
77
+ token_type_ids = kwargs.get("token_type_ids", None)
78
+ # only last token for inputs_ids if past is defined in kwargs
79
+ if past:
80
+ input_ids = input_ids[:, -1].unsqueeze(-1)
81
+ if token_type_ids is not None:
82
+ token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
83
+
84
+ attention_mask = kwargs.get("attention_mask", None)
85
+ position_ids = kwargs.get("position_ids", None)
86
+
87
+ if attention_mask is not None and position_ids is None:
88
+ # create position_ids on the fly for batch generation
89
+ position_ids = attention_mask.long().cumsum(-1) - 1
90
+ position_ids.masked_fill_(attention_mask == 0, 1)
91
+ if past:
92
+ position_ids = position_ids[:, -1].unsqueeze(-1)
93
+ else:
94
+ position_ids = None
95
+ return {
96
+ "input_ids": input_ids,
97
+ "past_key_values": past,
98
+ "use_cache": kwargs.get("use_cache"),
99
+ "position_ids": position_ids,
100
+ "attention_mask": attention_mask,
101
+ "token_type_ids": token_type_ids,
102
+ }
103
+
104
+ def forward(
105
+ self,
106
+ input_ids=None,
107
+ past_key_values=None,
108
+ attention_mask=None,
109
+ token_type_ids=None,
110
+ position_ids=None,
111
+ head_mask=None,
112
+ inputs_embeds=None,
113
+ encoder_hidden_states=None,
114
+ encoder_attention_mask=None,
115
+ labels=None,
116
+ use_cache=None,
117
+ output_attentions=None,
118
+ output_hidden_states=None,
119
+ return_dict=None,
120
+ ):
121
+ assert self.cached_mel_emb is not None
122
+ assert inputs_embeds is None # Not supported by this inference model.
123
+ assert labels is None # Training not supported by this inference model.
124
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
125
+
126
+ # Create embedding
127
+ mel_len = self.cached_mel_emb.shape[1]
128
+ if input_ids.shape[1] != 1:
129
+ text_inputs = input_ids[:, mel_len:]
130
+ text_emb = self.embeddings(text_inputs)
131
+ text_emb = text_emb + self.text_pos_embedding(text_emb)
132
+ if self.cached_mel_emb.shape[0] != text_emb.shape[0]:
133
+ mel_emb = self.cached_mel_emb.repeat_interleave(text_emb.shape[0]//self.cached_mel_emb.shape[0], 0)
134
+ else:
135
+ mel_emb = self.cached_mel_emb
136
+ emb = torch.cat([mel_emb, text_emb], dim=1)
137
+ else:
138
+ emb = self.embeddings(input_ids)
139
+ emb = emb + self.text_pos_embedding.get_fixed_embedding(attention_mask.shape[1]-mel_len, attention_mask.device)
140
+
141
+ transformer_outputs = self.transformer(
142
+ inputs_embeds=emb,
143
+ past_key_values=past_key_values,
144
+ attention_mask=attention_mask,
145
+ token_type_ids=token_type_ids,
146
+ position_ids=position_ids,
147
+ head_mask=head_mask,
148
+ encoder_hidden_states=encoder_hidden_states,
149
+ encoder_attention_mask=encoder_attention_mask,
150
+ use_cache=use_cache,
151
+ output_attentions=output_attentions,
152
+ output_hidden_states=output_hidden_states,
153
+ return_dict=return_dict,
154
+ )
155
+ hidden_states = transformer_outputs[0]
156
+
157
+ # Set device for model parallelism
158
+ if self.model_parallel:
159
+ torch.cuda.set_device(self.transformer.first_device)
160
+ hidden_states = hidden_states.to(self.lm_head.weight.device)
161
+
162
+ lm_logits = self.lm_head(hidden_states)
163
+
164
+ if not return_dict:
165
+ return (lm_logits,) + transformer_outputs[1:]
166
+
167
+ return CausalLMOutputWithCrossAttentions(
168
+ loss=None,
169
+ logits=lm_logits,
170
+ past_key_values=transformer_outputs.past_key_values,
171
+ hidden_states=transformer_outputs.hidden_states,
172
+ attentions=transformer_outputs.attentions,
173
+ cross_attentions=transformer_outputs.cross_attentions,
174
+ )
175
+
176
+ @staticmethod
177
+ def _reorder_cache(past, beam_idx):
178
+ """
179
+ This function is used to re-order the :obj:`past_key_values` cache if
180
+ :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
181
+ called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
182
+ """
183
+ return tuple(
184
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
185
+ for layer_past in past
186
+ )
187
+
188
+
189
+ class ConditioningEncoder(nn.Module):
190
+ def __init__(self,
191
+ spec_dim,
192
+ embedding_dim,
193
+ attn_blocks=6,
194
+ num_attn_heads=4,
195
+ do_checkpointing=False,
196
+ mean=False):
197
+ super().__init__()
198
+ attn = []
199
+ self.init = nn.Conv1d(spec_dim, embedding_dim, kernel_size=1)
200
+ for a in range(attn_blocks):
201
+ attn.append(AttentionBlock(embedding_dim, num_attn_heads))
202
+ self.attn = nn.Sequential(*attn)
203
+ self.dim = embedding_dim
204
+ self.do_checkpointing = do_checkpointing
205
+ self.mean = mean
206
+
207
+ def forward(self, x):
208
+ h = self.init(x)
209
+ h = self.attn(h)
210
+ if self.mean:
211
+ return h.mean(dim=2)
212
+ else:
213
+ return h[:, :, 0]
214
+
215
+
216
+ class LearnedPositionEmbeddings(nn.Module):
217
+ def __init__(self, seq_len, model_dim, init=.02):
218
+ super().__init__()
219
+ self.emb = nn.Embedding(seq_len, model_dim)
220
+ # Initializing this way is standard for GPT-2
221
+ self.emb.weight.data.normal_(mean=0.0, std=init)
222
+
223
+ def forward(self, x):
224
+ sl = x.shape[1]
225
+ return self.emb(torch.arange(0, sl, device=x.device))
226
+
227
+ def get_fixed_embedding(self, ind, dev):
228
+ return self.emb(torch.tensor([ind], device=dev)).unsqueeze(0)
229
+
230
+
231
+ def build_hf_gpt_transformer(layers, model_dim, heads, max_mel_seq_len, max_text_seq_len, checkpointing):
232
+ """
233
+ GPT-2 implemented by the HuggingFace library.
234
+ """
235
+ from transformers import GPT2Config, GPT2Model
236
+ gpt_config = GPT2Config(vocab_size=256, # Unused.
237
+ n_positions=max_mel_seq_len+max_text_seq_len,
238
+ n_ctx=max_mel_seq_len+max_text_seq_len,
239
+ n_embd=model_dim,
240
+ n_layer=layers,
241
+ n_head=heads,
242
+ gradient_checkpointing=checkpointing,
243
+ use_cache=not checkpointing)
244
+ gpt = GPT2Model(gpt_config)
245
+ # Override the built in positional embeddings
246
+ del gpt.wpe
247
+ gpt.wpe = functools.partial(null_position_embeddings, dim=model_dim)
248
+ # Built-in token embeddings are unused.
249
+ del gpt.wte
250
+ return gpt, LearnedPositionEmbeddings(max_mel_seq_len, model_dim), LearnedPositionEmbeddings(max_text_seq_len, model_dim),\
251
+ None, None
252
+
253
+
254
+ class MelEncoder(nn.Module):
255
+ def __init__(self, channels, mel_channels=80, resblocks_per_reduction=2):
256
+ super().__init__()
257
+ self.channels = channels
258
+ self.encoder = nn.Sequential(nn.Conv1d(mel_channels, channels//4, kernel_size=3, padding=1),
259
+ nn.Sequential(*[ResBlock(channels//4) for _ in range(resblocks_per_reduction)]),
260
+ nn.Conv1d(channels//4, channels//2, kernel_size=3, stride=2, padding=1),
261
+ nn.GroupNorm(channels//16, channels//2),
262
+ nn.ReLU(),
263
+ nn.Sequential(*[ResBlock(channels//2) for _ in range(resblocks_per_reduction)]),
264
+ nn.Conv1d(channels//2, channels, kernel_size=3, stride=2, padding=1),
265
+ nn.GroupNorm(channels//8, channels),
266
+ nn.ReLU(),
267
+ nn.Sequential(*[ResBlock(channels) for _ in range(resblocks_per_reduction)]),
268
+ )
269
+ self.reduction = 4
270
+
271
+
272
+ def forward(self, x):
273
+ for e in self.encoder:
274
+ x = e(x)
275
+ return x.permute(0,2,1)
276
+
277
+
278
+ class UnifiedVoice(nn.Module):
279
+ def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1,
280
+ mel_length_compression=1024, number_text_tokens=256,
281
+ start_text_token=None, number_mel_codes=8194, start_mel_token=8192,
282
+ stop_mel_token=8193, train_solo_embeddings=False, use_mel_codes_as_input=True,
283
+ checkpointing=True, types=1):
284
+ """
285
+ Args:
286
+ layers: Number of layers in transformer stack.
287
+ model_dim: Operating dimensions of the transformer
288
+ heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64
289
+ max_text_tokens: Maximum number of text tokens that will be encountered by model.
290
+ max_mel_tokens: Maximum number of MEL tokens that will be encountered by model.
291
+ max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s).
292
+ mel_length_compression: The factor between <number_input_samples> and <mel_tokens>. Used to compute MEL code padding given wav input length.
293
+ number_text_tokens:
294
+ start_text_token:
295
+ stop_text_token:
296
+ number_mel_codes:
297
+ start_mel_token:
298
+ stop_mel_token:
299
+ train_solo_embeddings:
300
+ use_mel_codes_as_input:
301
+ checkpointing:
302
+ """
303
+ super().__init__()
304
+
305
+ self.number_text_tokens = number_text_tokens
306
+ self.start_text_token = number_text_tokens * types if start_text_token is None else start_text_token
307
+ self.stop_text_token = 0
308
+ self.number_mel_codes = number_mel_codes
309
+ self.start_mel_token = start_mel_token
310
+ self.stop_mel_token = stop_mel_token
311
+ self.layers = layers
312
+ self.heads = heads
313
+ self.max_mel_tokens = max_mel_tokens
314
+ self.max_text_tokens = max_text_tokens
315
+ self.model_dim = model_dim
316
+ self.max_conditioning_inputs = max_conditioning_inputs
317
+ self.mel_length_compression = mel_length_compression
318
+ self.conditioning_encoder = ConditioningEncoder(80, model_dim, num_attn_heads=heads)
319
+ self.text_embedding = nn.Embedding(self.number_text_tokens*types+1, model_dim)
320
+ if use_mel_codes_as_input:
321
+ self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim)
322
+ else:
323
+ self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1)
324
+ self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \
325
+ build_hf_gpt_transformer(layers, model_dim, heads, self.max_mel_tokens+2+self.max_conditioning_inputs, self.max_text_tokens+2, checkpointing)
326
+ if train_solo_embeddings:
327
+ self.mel_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
328
+ self.text_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
329
+ else:
330
+ self.mel_solo_embedding = 0
331
+ self.text_solo_embedding = 0
332
+
333
+ self.final_norm = nn.LayerNorm(model_dim)
334
+ self.text_head = nn.Linear(model_dim, self.number_text_tokens*types+1)
335
+ self.mel_head = nn.Linear(model_dim, self.number_mel_codes)
336
+
337
+ # Initialize the embeddings per the GPT-2 scheme
338
+ embeddings = [self.text_embedding]
339
+ if use_mel_codes_as_input:
340
+ embeddings.append(self.mel_embedding)
341
+ for module in embeddings:
342
+ module.weight.data.normal_(mean=0.0, std=.02)
343
+
344
+ def build_aligned_inputs_and_targets(self, input, start_token, stop_token):
345
+ inp = F.pad(input, (1,0), value=start_token)
346
+ tar = F.pad(input, (0,1), value=stop_token)
347
+ return inp, tar
348
+
349
+ def set_mel_padding(self, mel_input_tokens, wav_lengths):
350
+ """
351
+ Given mel tokens that are derived from a padded audio clip and the actual lengths of each batch element in
352
+ that audio clip, reformats the tokens with STOP_MEL_TOKEN in place of the zero padding. This is required
353
+ preformatting to create a working TTS model.
354
+ """
355
+ # Set padding areas within MEL (currently it is coded with the MEL code for <zero>).
356
+ mel_lengths = torch.div(wav_lengths, self.mel_length_compression, rounding_mode='trunc')
357
+ for b in range(len(mel_lengths)):
358
+ actual_end = mel_lengths[b] + 1 # Due to the convolutional nature of how these tokens are generated, it would be best if the model predicts a token past the actual last token.
359
+ if actual_end < mel_input_tokens.shape[-1]:
360
+ mel_input_tokens[b, actual_end:] = self.stop_mel_token
361
+ return mel_input_tokens
362
+
363
+ def get_logits(self, speech_conditioning_inputs, first_inputs, first_head, second_inputs=None, second_head=None, get_attns=False, return_latent=False):
364
+ if second_inputs is not None:
365
+ emb = torch.cat([speech_conditioning_inputs, first_inputs, second_inputs], dim=1)
366
+ else:
367
+ emb = torch.cat([speech_conditioning_inputs, first_inputs], dim=1)
368
+
369
+ gpt_out = self.gpt(inputs_embeds=emb, return_dict=True, output_attentions=get_attns)
370
+ if get_attns:
371
+ return gpt_out.attentions
372
+
373
+ enc = gpt_out.last_hidden_state[:, 1:] # The first logit is tied to the speech_conditioning_input
374
+ enc = self.final_norm(enc)
375
+
376
+ if return_latent:
377
+ return enc[:, speech_conditioning_inputs.shape[1]:speech_conditioning_inputs.shape[1]+first_inputs.shape[1]], enc[:, -second_inputs.shape[1]:]
378
+
379
+ first_logits = enc[:, :first_inputs.shape[1]]
380
+ first_logits = first_head(first_logits)
381
+ first_logits = first_logits.permute(0,2,1)
382
+ if second_inputs is not None:
383
+ second_logits = enc[:, -second_inputs.shape[1]:]
384
+ second_logits = second_head(second_logits)
385
+ second_logits = second_logits.permute(0,2,1)
386
+ return first_logits, second_logits
387
+ else:
388
+ return first_logits
389
+
390
+ def get_conditioning(self, speech_conditioning_input):
391
+ speech_conditioning_input = speech_conditioning_input.unsqueeze(1) if len(
392
+ speech_conditioning_input.shape) == 3 else speech_conditioning_input
393
+ conds = []
394
+ for j in range(speech_conditioning_input.shape[1]):
395
+ conds.append(self.conditioning_encoder(speech_conditioning_input[:, j]))
396
+ conds = torch.stack(conds, dim=1)
397
+ conds = conds.mean(dim=1)
398
+ return conds
399
+
400
+ def forward(self, speech_conditioning_latent, text_inputs, text_lengths, mel_codes, wav_lengths, types=None, text_first=True, raw_mels=None, return_attentions=False,
401
+ return_latent=False, clip_inputs=True):
402
+ """
403
+ Forward pass that uses both text and voice in either text conditioning mode or voice conditioning mode
404
+ (actuated by `text_first`).
405
+
406
+ speech_conditioning_input: MEL float tensor, (b,1024)
407
+ text_inputs: long tensor, (b,t)
408
+ text_lengths: long tensor, (b,)
409
+ mel_inputs: long tensor, (b,m)
410
+ wav_lengths: long tensor, (b,)
411
+ raw_mels: MEL float tensor (b,80,s)
412
+
413
+ If return_attentions is specified, only logits are returned.
414
+ If return_latent is specified, loss & logits are not computed or returned. Only the predicted latents are returned.
415
+ If clip_inputs is True, the inputs will be clipped to the smallest input size across each input modality.
416
+ """
417
+ # Types are expressed by expanding the text embedding space.
418
+ if types is not None:
419
+ text_inputs = text_inputs * (1+types).unsqueeze(-1)
420
+
421
+ if clip_inputs:
422
+ # This model will receive micro-batches with a ton of padding for both the text and MELs. Ameliorate this by
423
+ # chopping the inputs by the maximum actual length.
424
+ max_text_len = text_lengths.max()
425
+ text_inputs = text_inputs[:, :max_text_len]
426
+ max_mel_len = wav_lengths.max() // self.mel_length_compression
427
+ mel_codes = mel_codes[:, :max_mel_len]
428
+ if raw_mels is not None:
429
+ raw_mels = raw_mels[:, :, :max_mel_len*4]
430
+ mel_codes = self.set_mel_padding(mel_codes, wav_lengths)
431
+ text_inputs = F.pad(text_inputs, (0,1), value=self.stop_text_token)
432
+ mel_codes = F.pad(mel_codes, (0,1), value=self.stop_mel_token)
433
+
434
+ conds = speech_conditioning_latent.unsqueeze(1)
435
+ text_inputs, text_targets = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token)
436
+ text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)
437
+ mel_codes, mel_targets = self.build_aligned_inputs_and_targets(mel_codes, self.start_mel_token, self.stop_mel_token)
438
+ if raw_mels is not None:
439
+ mel_inp = F.pad(raw_mels, (0, 8))
440
+ else:
441
+ mel_inp = mel_codes
442
+ mel_emb = self.mel_embedding(mel_inp)
443
+ mel_emb = mel_emb + self.mel_pos_embedding(mel_codes)
444
+
445
+ if text_first:
446
+ text_logits, mel_logits = self.get_logits(conds, text_emb, self.text_head, mel_emb, self.mel_head, get_attns=return_attentions, return_latent=return_latent)
447
+ if return_latent:
448
+ return mel_logits[:, :-2] # Despite the name, these are not logits. Strip off the two tokens added by this forward pass.
449
+ else:
450
+ mel_logits, text_logits = self.get_logits(conds, mel_emb, self.mel_head, text_emb, self.text_head, get_attns=return_attentions, return_latent=return_latent)
451
+ if return_latent:
452
+ return text_logits[:, :-2] # Despite the name, these are not logits. Strip off the two tokens added by this forward pass.
453
+
454
+ if return_attentions:
455
+ return mel_logits
456
+ loss_text = F.cross_entropy(text_logits, text_targets.long())
457
+ loss_mel = F.cross_entropy(mel_logits, mel_targets.long())
458
+ return loss_text.mean(), loss_mel.mean(), mel_logits
459
+
460
+ def inference_speech(self, speech_conditioning_latent, text_inputs, input_tokens=None, num_return_sequences=1,
461
+ max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs):
462
+ seq_length = self.max_mel_tokens + self.max_text_tokens + 2
463
+ if not hasattr(self, 'inference_model'):
464
+ # TODO: Decouple gpt_config from this inference model.
465
+ gpt_config = GPT2Config(vocab_size=self.max_mel_tokens,
466
+ n_positions=seq_length,
467
+ n_ctx=seq_length,
468
+ n_embd=self.model_dim,
469
+ n_layer=self.layers,
470
+ n_head=self.heads,
471
+ gradient_checkpointing=False,
472
+ use_cache=True)
473
+ self.inference_model = GPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head)
474
+ self.gpt.wte = self.mel_embedding
475
+
476
+ text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)
477
+ text_inputs, text_targets = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token)
478
+ text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)
479
+
480
+ conds = speech_conditioning_latent.unsqueeze(1)
481
+ emb = torch.cat([conds, text_emb], dim=1)
482
+ self.inference_model.store_mel_emb(emb)
483
+
484
+ fake_inputs = torch.full((emb.shape[0], conds.shape[1] + emb.shape[1],), fill_value=1, dtype=torch.long,
485
+ device=text_inputs.device)
486
+ fake_inputs[:, -1] = self.start_mel_token
487
+ trunc_index = fake_inputs.shape[1]
488
+ if input_tokens is None:
489
+ inputs = fake_inputs
490
+ else:
491
+ assert num_return_sequences % input_tokens.shape[0] == 0, "The number of return sequences must be divisible by the number of input sequences"
492
+ fake_inputs = fake_inputs.repeat(num_return_sequences, 1)
493
+ input_tokens = input_tokens.repeat(num_return_sequences // input_tokens.shape[0], 1)
494
+ inputs = torch.cat([fake_inputs, input_tokens], dim=1)
495
+
496
+ logits_processor = LogitsProcessorList([TypicalLogitsWarper(mass=typical_mass)]) if typical_sampling else LogitsProcessorList()
497
+ max_length = trunc_index + self.max_mel_tokens - 1 if max_generate_length is None else trunc_index + max_generate_length
498
+ gen = self.inference_model.generate(inputs, bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token, eos_token_id=self.stop_mel_token,
499
+ max_length=max_length, logits_processor=logits_processor,
500
+ num_return_sequences=num_return_sequences, **hf_generate_kwargs)
501
+ return gen[:, trunc_index:]
502
+
503
+
504
+ if __name__ == '__main__':
505
+ gpt = UnifiedVoice(model_dim=256, heads=4, train_solo_embeddings=True, use_mel_codes_as_input=True, max_conditioning_inputs=4)
506
+ l = gpt(torch.randn(2, 3, 80, 800),
507
+ torch.randint(high=120, size=(2,120)),
508
+ torch.tensor([32, 120]),
509
+ torch.randint(high=8192, size=(2,250)),
510
+ torch.tensor([250*256,195*256]))
511
+ gpt.text_forward(torch.randn(2,80,800), torch.randint(high=50, size=(2,80)), torch.tensor([32, 80]))
tortoise/models/classifier.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from torch.utils.checkpoint import checkpoint
4
+
5
+ from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock
6
+
7
+
8
+ class ResBlock(nn.Module):
9
+ def __init__(
10
+ self,
11
+ channels,
12
+ dropout,
13
+ out_channels=None,
14
+ use_conv=False,
15
+ use_scale_shift_norm=False,
16
+ dims=2,
17
+ up=False,
18
+ down=False,
19
+ kernel_size=3,
20
+ do_checkpoint=True,
21
+ ):
22
+ super().__init__()
23
+ self.channels = channels
24
+ self.dropout = dropout
25
+ self.out_channels = out_channels or channels
26
+ self.use_conv = use_conv
27
+ self.use_scale_shift_norm = use_scale_shift_norm
28
+ self.do_checkpoint = do_checkpoint
29
+ padding = 1 if kernel_size == 3 else 2
30
+
31
+ self.in_layers = nn.Sequential(
32
+ normalization(channels),
33
+ nn.SiLU(),
34
+ nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding),
35
+ )
36
+
37
+ self.updown = up or down
38
+
39
+ if up:
40
+ self.h_upd = Upsample(channels, False, dims)
41
+ self.x_upd = Upsample(channels, False, dims)
42
+ elif down:
43
+ self.h_upd = Downsample(channels, False, dims)
44
+ self.x_upd = Downsample(channels, False, dims)
45
+ else:
46
+ self.h_upd = self.x_upd = nn.Identity()
47
+
48
+ self.out_layers = nn.Sequential(
49
+ normalization(self.out_channels),
50
+ nn.SiLU(),
51
+ nn.Dropout(p=dropout),
52
+ zero_module(
53
+ nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding)
54
+ ),
55
+ )
56
+
57
+ if self.out_channels == channels:
58
+ self.skip_connection = nn.Identity()
59
+ elif use_conv:
60
+ self.skip_connection = nn.Conv1d(
61
+ dims, channels, self.out_channels, kernel_size, padding=padding
62
+ )
63
+ else:
64
+ self.skip_connection = nn.Conv1d(dims, channels, self.out_channels, 1)
65
+
66
+ def forward(self, x):
67
+ if self.do_checkpoint:
68
+ return checkpoint(
69
+ self._forward, x
70
+ )
71
+ else:
72
+ return self._forward(x)
73
+
74
+ def _forward(self, x):
75
+ if self.updown:
76
+ in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
77
+ h = in_rest(x)
78
+ h = self.h_upd(h)
79
+ x = self.x_upd(x)
80
+ h = in_conv(h)
81
+ else:
82
+ h = self.in_layers(x)
83
+ h = self.out_layers(h)
84
+ return self.skip_connection(x) + h
85
+
86
+
87
+ class AudioMiniEncoder(nn.Module):
88
+ def __init__(self,
89
+ spec_dim,
90
+ embedding_dim,
91
+ base_channels=128,
92
+ depth=2,
93
+ resnet_blocks=2,
94
+ attn_blocks=4,
95
+ num_attn_heads=4,
96
+ dropout=0,
97
+ downsample_factor=2,
98
+ kernel_size=3):
99
+ super().__init__()
100
+ self.init = nn.Sequential(
101
+ nn.Conv1d(spec_dim, base_channels, 3, padding=1)
102
+ )
103
+ ch = base_channels
104
+ res = []
105
+ self.layers = depth
106
+ for l in range(depth):
107
+ for r in range(resnet_blocks):
108
+ res.append(ResBlock(ch, dropout, do_checkpoint=False, kernel_size=kernel_size))
109
+ res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor))
110
+ ch *= 2
111
+ self.res = nn.Sequential(*res)
112
+ self.final = nn.Sequential(
113
+ normalization(ch),
114
+ nn.SiLU(),
115
+ nn.Conv1d(ch, embedding_dim, 1)
116
+ )
117
+ attn = []
118
+ for a in range(attn_blocks):
119
+ attn.append(AttentionBlock(embedding_dim, num_attn_heads, do_checkpoint=False))
120
+ self.attn = nn.Sequential(*attn)
121
+ self.dim = embedding_dim
122
+
123
+ def forward(self, x):
124
+ h = self.init(x)
125
+ h = self.res(h)
126
+ h = self.final(h)
127
+ for blk in self.attn:
128
+ h = checkpoint(blk, h)
129
+ return h[:, :, 0]
130
+
131
+
132
+ class AudioMiniEncoderWithClassifierHead(nn.Module):
133
+ def __init__(self, classes, distribute_zero_label=True, **kwargs):
134
+ super().__init__()
135
+ self.enc = AudioMiniEncoder(**kwargs)
136
+ self.head = nn.Linear(self.enc.dim, classes)
137
+ self.num_classes = classes
138
+ self.distribute_zero_label = distribute_zero_label
139
+
140
+ def forward(self, x, labels=None):
141
+ h = self.enc(x)
142
+ logits = self.head(h)
143
+ if labels is None:
144
+ return logits
145
+ else:
146
+ if self.distribute_zero_label:
147
+ oh_labels = nn.functional.one_hot(labels, num_classes=self.num_classes)
148
+ zeros_indices = (labels == 0).unsqueeze(-1)
149
+ # Distribute 20% of the probability mass on all classes when zero is specified, to compensate for dataset noise.
150
+ zero_extra_mass = torch.full_like(oh_labels, dtype=torch.float, fill_value=.2/(self.num_classes-1))
151
+ zero_extra_mass[:, 0] = -.2
152
+ zero_extra_mass = zero_extra_mass * zeros_indices
153
+ oh_labels = oh_labels + zero_extra_mass
154
+ else:
155
+ oh_labels = labels
156
+ loss = nn.functional.cross_entropy(logits, oh_labels)
157
+ return loss
tortoise/models/clvp.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torch import einsum
5
+
6
+ from tortoise.models.arch_util import CheckpointedXTransformerEncoder
7
+ from tortoise.models.transformer import Transformer
8
+ from tortoise.models.xtransformers import Encoder
9
+
10
+
11
+ def exists(val):
12
+ return val is not None
13
+
14
+
15
+ def masked_mean(t, mask, dim = 1):
16
+ t = t.masked_fill(~mask[:, :, None], 0.)
17
+ return t.sum(dim = 1) / mask.sum(dim = 1)[..., None]
18
+
19
+ class CLVP(nn.Module):
20
+ """
21
+ CLIP model retrofitted for performing contrastive evaluation between tokenized audio data and the corresponding
22
+ transcribed text.
23
+
24
+ Originally from https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py
25
+ """
26
+
27
+ def __init__(
28
+ self,
29
+ *,
30
+ dim_text=512,
31
+ dim_speech=512,
32
+ dim_latent=512,
33
+ num_text_tokens=256,
34
+ text_enc_depth=6,
35
+ text_seq_len=120,
36
+ text_heads=8,
37
+ num_speech_tokens=8192,
38
+ speech_enc_depth=6,
39
+ speech_heads=8,
40
+ speech_seq_len=250,
41
+ text_mask_percentage=0,
42
+ voice_mask_percentage=0,
43
+ wav_token_compression=1024,
44
+ use_xformers=False,
45
+ ):
46
+ super().__init__()
47
+ self.text_emb = nn.Embedding(num_text_tokens, dim_text)
48
+ self.to_text_latent = nn.Linear(dim_text, dim_latent, bias=False)
49
+
50
+ self.speech_emb = nn.Embedding(num_speech_tokens, dim_speech)
51
+ self.to_speech_latent = nn.Linear(dim_speech, dim_latent, bias=False)
52
+
53
+ if use_xformers:
54
+ self.text_transformer = CheckpointedXTransformerEncoder(
55
+ needs_permute=False,
56
+ exit_permute=False,
57
+ max_seq_len=-1,
58
+ attn_layers=Encoder(
59
+ dim=dim_text,
60
+ depth=text_enc_depth,
61
+ heads=text_heads,
62
+ ff_dropout=.1,
63
+ ff_mult=2,
64
+ attn_dropout=.1,
65
+ use_rmsnorm=True,
66
+ ff_glu=True,
67
+ rotary_pos_emb=True,
68
+ ))
69
+ self.speech_transformer = CheckpointedXTransformerEncoder(
70
+ needs_permute=False,
71
+ exit_permute=False,
72
+ max_seq_len=-1,
73
+ attn_layers=Encoder(
74
+ dim=dim_speech,
75
+ depth=speech_enc_depth,
76
+ heads=speech_heads,
77
+ ff_dropout=.1,
78
+ ff_mult=2,
79
+ attn_dropout=.1,
80
+ use_rmsnorm=True,
81
+ ff_glu=True,
82
+ rotary_pos_emb=True,
83
+ ))
84
+ else:
85
+ self.text_transformer = Transformer(causal=False, seq_len=text_seq_len, dim=dim_text, depth=text_enc_depth,
86
+ heads=text_heads)
87
+ self.speech_transformer = Transformer(causal=False, seq_len=speech_seq_len, dim=dim_speech,
88
+ depth=speech_enc_depth, heads=speech_heads)
89
+
90
+ self.temperature = nn.Parameter(torch.tensor(1.))
91
+ self.text_mask_percentage = text_mask_percentage
92
+ self.voice_mask_percentage = voice_mask_percentage
93
+ self.wav_token_compression = wav_token_compression
94
+ self.xformers = use_xformers
95
+ if not use_xformers:
96
+ self.text_pos_emb = nn.Embedding(text_seq_len, dim_text)
97
+ self.speech_pos_emb = nn.Embedding(num_speech_tokens, dim_speech)
98
+
99
+ def forward(
100
+ self,
101
+ text,
102
+ speech_tokens,
103
+ return_loss=False
104
+ ):
105
+ b, device = text.shape[0], text.device
106
+ if self.training:
107
+ text_mask = torch.rand_like(text.float()) > self.text_mask_percentage
108
+ voice_mask = torch.rand_like(speech_tokens.float()) > self.voice_mask_percentage
109
+ else:
110
+ text_mask = torch.ones_like(text.float()).bool()
111
+ voice_mask = torch.ones_like(speech_tokens.float()).bool()
112
+
113
+ text_emb = self.text_emb(text)
114
+ speech_emb = self.speech_emb(speech_tokens)
115
+
116
+ if not self.xformers:
117
+ text_emb += self.text_pos_emb(torch.arange(text.shape[1], device=device))
118
+ speech_emb += self.speech_pos_emb(torch.arange(speech_emb.shape[1], device=device))
119
+
120
+ enc_text = self.text_transformer(text_emb, mask=text_mask)
121
+ enc_speech = self.speech_transformer(speech_emb, mask=voice_mask)
122
+
123
+ text_latents = masked_mean(enc_text, text_mask, dim=1)
124
+ speech_latents = masked_mean(enc_speech, voice_mask, dim=1)
125
+
126
+ text_latents = self.to_text_latent(text_latents)
127
+ speech_latents = self.to_speech_latent(speech_latents)
128
+
129
+ text_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (text_latents, speech_latents))
130
+
131
+ temp = self.temperature.exp()
132
+
133
+ if not return_loss:
134
+ sim = einsum('n d, n d -> n', text_latents, speech_latents) * temp
135
+ return sim
136
+
137
+ sim = einsum('i d, j d -> i j', text_latents, speech_latents) * temp
138
+ labels = torch.arange(b, device=device)
139
+ loss = (F.cross_entropy(sim, labels) + F.cross_entropy(sim.t(), labels)) / 2
140
+ return loss
141
+
142
+
143
+ if __name__ == '__main__':
144
+ clip = CLVP(text_mask_percentage=.2, voice_mask_percentage=.2)
145
+ clip(torch.randint(0,256,(2,120)),
146
+ torch.tensor([50,100]),
147
+ torch.randint(0,8192,(2,250)),
148
+ torch.tensor([101,102]),
149
+ return_loss=True)
150
+ nonloss = clip(torch.randint(0,256,(2,120)),
151
+ torch.tensor([50,100]),
152
+ torch.randint(0,8192,(2,250)),
153
+ torch.tensor([101,102]),
154
+ return_loss=False)
155
+ print(nonloss.shape)
tortoise/models/cvvp.py ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torch import einsum
5
+ from torch.utils.checkpoint import checkpoint
6
+
7
+ from tortoise.models.arch_util import AttentionBlock
8
+ from tortoise.models.xtransformers import ContinuousTransformerWrapper, Encoder
9
+
10
+
11
+ def exists(val):
12
+ return val is not None
13
+
14
+
15
+ def masked_mean(t, mask):
16
+ t = t.masked_fill(~mask, 0.)
17
+ return t.sum(dim = 1) / mask.sum(dim = 1)
18
+
19
+
20
+ class CollapsingTransformer(nn.Module):
21
+ def __init__(self, model_dim, output_dims, heads, dropout, depth, mask_percentage=0, **encoder_kwargs):
22
+ super().__init__()
23
+ self.transformer = ContinuousTransformerWrapper(
24
+ max_seq_len=-1,
25
+ use_pos_emb=False,
26
+ attn_layers=Encoder(
27
+ dim=model_dim,
28
+ depth=depth,
29
+ heads=heads,
30
+ ff_dropout=dropout,
31
+ ff_mult=1,
32
+ attn_dropout=dropout,
33
+ use_rmsnorm=True,
34
+ ff_glu=True,
35
+ rotary_pos_emb=True,
36
+ **encoder_kwargs,
37
+ ))
38
+ self.pre_combiner = nn.Sequential(nn.Conv1d(model_dim, output_dims, 1),
39
+ AttentionBlock(output_dims, num_heads=heads, do_checkpoint=False),
40
+ nn.Conv1d(output_dims, output_dims, 1))
41
+ self.mask_percentage = mask_percentage
42
+
43
+ def forward(self, x, **transformer_kwargs):
44
+ h = self.transformer(x, **transformer_kwargs)
45
+ h = h.permute(0,2,1)
46
+ h = checkpoint(self.pre_combiner, h).permute(0,2,1)
47
+ if self.training:
48
+ mask = torch.rand_like(h.float()) > self.mask_percentage
49
+ else:
50
+ mask = torch.ones_like(h.float()).bool()
51
+ return masked_mean(h, mask)
52
+
53
+
54
+ class ConvFormatEmbedding(nn.Module):
55
+ def __init__(self, *args, **kwargs):
56
+ super().__init__()
57
+ self.emb = nn.Embedding(*args, **kwargs)
58
+
59
+ def forward(self, x):
60
+ y = self.emb(x)
61
+ return y.permute(0,2,1)
62
+
63
+
64
+ class CVVP(nn.Module):
65
+ def __init__(
66
+ self,
67
+ model_dim=512,
68
+ transformer_heads=8,
69
+ dropout=.1,
70
+ conditioning_enc_depth=8,
71
+ cond_mask_percentage=0,
72
+ mel_channels=80,
73
+ mel_codes=None,
74
+ speech_enc_depth=8,
75
+ speech_mask_percentage=0,
76
+ latent_multiplier=1,
77
+ ):
78
+ super().__init__()
79
+ latent_dim = latent_multiplier*model_dim
80
+ self.temperature = nn.Parameter(torch.tensor(1.))
81
+
82
+ self.cond_emb = nn.Sequential(nn.Conv1d(mel_channels, model_dim//2, kernel_size=5, stride=2, padding=2),
83
+ nn.Conv1d(model_dim//2, model_dim, kernel_size=3, stride=2, padding=1))
84
+ self.conditioning_transformer = CollapsingTransformer(model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage)
85
+ self.to_conditioning_latent = nn.Linear(latent_dim, latent_dim, bias=False)
86
+
87
+ if mel_codes is None:
88
+ self.speech_emb = nn.Conv1d(mel_channels, model_dim, kernel_size=5, padding=2)
89
+ else:
90
+ self.speech_emb = ConvFormatEmbedding(mel_codes, model_dim)
91
+ self.speech_transformer = CollapsingTransformer(model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage)
92
+ self.to_speech_latent = nn.Linear(latent_dim, latent_dim, bias=False)
93
+
94
+ def get_grad_norm_parameter_groups(self):
95
+ return {
96
+ 'conditioning': list(self.conditioning_transformer.parameters()),
97
+ 'speech': list(self.speech_transformer.parameters()),
98
+ }
99
+
100
+ def forward(
101
+ self,
102
+ mel_cond,
103
+ mel_input,
104
+ return_loss=False
105
+ ):
106
+ cond_emb = self.cond_emb(mel_cond).permute(0,2,1)
107
+ enc_cond = self.conditioning_transformer(cond_emb)
108
+ cond_latents = self.to_conditioning_latent(enc_cond)
109
+
110
+ speech_emb = self.speech_emb(mel_input).permute(0,2,1)
111
+ enc_speech = self.speech_transformer(speech_emb)
112
+ speech_latents = self.to_speech_latent(enc_speech)
113
+
114
+
115
+ cond_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (cond_latents, speech_latents))
116
+ temp = self.temperature.exp()
117
+
118
+ if not return_loss:
119
+ sim = einsum('n d, n d -> n', cond_latents, speech_latents) * temp
120
+ return sim
121
+
122
+ sim = einsum('i d, j d -> i j', cond_latents, speech_latents) * temp
123
+ labels = torch.arange(cond_latents.shape[0], device=mel_input.device)
124
+ loss = (F.cross_entropy(sim, labels) + F.cross_entropy(sim.t(), labels)) / 2
125
+
126
+ return loss
127
+
128
+
129
+ if __name__ == '__main__':
130
+ clvp = CVVP()
131
+ clvp(torch.randn(2,80,100),
132
+ torch.randn(2,80,95),
133
+ return_loss=True)
tortoise/models/diffusion_decoder.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import random
3
+ from abc import abstractmethod
4
+
5
+ import torch
6
+ import torch.nn as nn
7
+ import torch.nn.functional as F
8
+ from torch import autocast
9
+
10
+ from tortoise.models.arch_util import normalization, AttentionBlock
11
+
12
+
13
+ def is_latent(t):
14
+ return t.dtype == torch.float
15
+
16
+
17
+ def is_sequence(t):
18
+ return t.dtype == torch.long
19
+
20
+
21
+ def timestep_embedding(timesteps, dim, max_period=10000):
22
+ """
23
+ Create sinusoidal timestep embeddings.
24
+
25
+ :param timesteps: a 1-D Tensor of N indices, one per batch element.
26
+ These may be fractional.
27
+ :param dim: the dimension of the output.
28
+ :param max_period: controls the minimum frequency of the embeddings.
29
+ :return: an [N x dim] Tensor of positional embeddings.
30
+ """
31
+ half = dim // 2
32
+ freqs = torch.exp(
33
+ -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
34
+ ).to(device=timesteps.device)
35
+ args = timesteps[:, None].float() * freqs[None]
36
+ embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
37
+ if dim % 2:
38
+ embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
39
+ return embedding
40
+
41
+
42
+ class TimestepBlock(nn.Module):
43
+ @abstractmethod
44
+ def forward(self, x, emb):
45
+ """
46
+ Apply the module to `x` given `emb` timestep embeddings.
47
+ """
48
+
49
+
50
+ class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
51
+ def forward(self, x, emb):
52
+ for layer in self:
53
+ if isinstance(layer, TimestepBlock):
54
+ x = layer(x, emb)
55
+ else:
56
+ x = layer(x)
57
+ return x
58
+
59
+
60
+ class ResBlock(TimestepBlock):
61
+ def __init__(
62
+ self,
63
+ channels,
64
+ emb_channels,
65
+ dropout,
66
+ out_channels=None,
67
+ dims=2,
68
+ kernel_size=3,
69
+ efficient_config=True,
70
+ use_scale_shift_norm=False,
71
+ ):
72
+ super().__init__()
73
+ self.channels = channels
74
+ self.emb_channels = emb_channels
75
+ self.dropout = dropout
76
+ self.out_channels = out_channels or channels
77
+ self.use_scale_shift_norm = use_scale_shift_norm
78
+ padding = {1: 0, 3: 1, 5: 2}[kernel_size]
79
+ eff_kernel = 1 if efficient_config else 3
80
+ eff_padding = 0 if efficient_config else 1
81
+
82
+ self.in_layers = nn.Sequential(
83
+ normalization(channels),
84
+ nn.SiLU(),
85
+ nn.Conv1d(channels, self.out_channels, eff_kernel, padding=eff_padding),
86
+ )
87
+
88
+ self.emb_layers = nn.Sequential(
89
+ nn.SiLU(),
90
+ nn.Linear(
91
+ emb_channels,
92
+ 2 * self.out_channels if use_scale_shift_norm else self.out_channels,
93
+ ),
94
+ )
95
+ self.out_layers = nn.Sequential(
96
+ normalization(self.out_channels),
97
+ nn.SiLU(),
98
+ nn.Dropout(p=dropout),
99
+ nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding),
100
+ )
101
+
102
+ if self.out_channels == channels:
103
+ self.skip_connection = nn.Identity()
104
+ else:
105
+ self.skip_connection = nn.Conv1d(channels, self.out_channels, eff_kernel, padding=eff_padding)
106
+
107
+ def forward(self, x, emb):
108
+ h = self.in_layers(x)
109
+ emb_out = self.emb_layers(emb).type(h.dtype)
110
+ while len(emb_out.shape) < len(h.shape):
111
+ emb_out = emb_out[..., None]
112
+ if self.use_scale_shift_norm:
113
+ out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
114
+ scale, shift = torch.chunk(emb_out, 2, dim=1)
115
+ h = out_norm(h) * (1 + scale) + shift
116
+ h = out_rest(h)
117
+ else:
118
+ h = h + emb_out
119
+ h = self.out_layers(h)
120
+ return self.skip_connection(x) + h
121
+
122
+
123
+ class DiffusionLayer(TimestepBlock):
124
+ def __init__(self, model_channels, dropout, num_heads):
125
+ super().__init__()
126
+ self.resblk = ResBlock(model_channels, model_channels, dropout, model_channels, dims=1, use_scale_shift_norm=True)
127
+ self.attn = AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True)
128
+
129
+ def forward(self, x, time_emb):
130
+ y = self.resblk(x, time_emb)
131
+ return self.attn(y)
132
+
133
+
134
+ class DiffusionTts(nn.Module):
135
+ def __init__(
136
+ self,
137
+ model_channels=512,
138
+ num_layers=8,
139
+ in_channels=100,
140
+ in_latent_channels=512,
141
+ in_tokens=8193,
142
+ out_channels=200, # mean and variance
143
+ dropout=0,
144
+ use_fp16=False,
145
+ num_heads=16,
146
+ # Parameters for regularization.
147
+ layer_drop=.1,
148
+ unconditioned_percentage=.1, # This implements a mechanism similar to what is used in classifier-free training.
149
+ ):
150
+ super().__init__()
151
+
152
+ self.in_channels = in_channels
153
+ self.model_channels = model_channels
154
+ self.out_channels = out_channels
155
+ self.dropout = dropout
156
+ self.num_heads = num_heads
157
+ self.unconditioned_percentage = unconditioned_percentage
158
+ self.enable_fp16 = use_fp16
159
+ self.layer_drop = layer_drop
160
+
161
+ self.inp_block = nn.Conv1d(in_channels, model_channels, 3, 1, 1)
162
+ self.time_embed = nn.Sequential(
163
+ nn.Linear(model_channels, model_channels),
164
+ nn.SiLU(),
165
+ nn.Linear(model_channels, model_channels),
166
+ )
167
+
168
+ # Either code_converter or latent_converter is used, depending on what type of conditioning data is fed.
169
+ # This model is meant to be able to be trained on both for efficiency purposes - it is far less computationally
170
+ # complex to generate tokens, while generating latents will normally mean propagating through a deep autoregressive
171
+ # transformer network.
172
+ self.code_embedding = nn.Embedding(in_tokens, model_channels)
173
+ self.code_converter = nn.Sequential(
174
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
175
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
176
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
177
+ )
178
+ self.code_norm = normalization(model_channels)
179
+ self.latent_conditioner = nn.Sequential(
180
+ nn.Conv1d(in_latent_channels, model_channels, 3, padding=1),
181
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
182
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
183
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
184
+ AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True),
185
+ )
186
+ self.contextual_embedder = nn.Sequential(nn.Conv1d(in_channels,model_channels,3,padding=1,stride=2),
187
+ nn.Conv1d(model_channels, model_channels*2,3,padding=1,stride=2),
188
+ AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False),
189
+ AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False),
190
+ AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False),
191
+ AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False),
192
+ AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False))
193
+ self.unconditioned_embedding = nn.Parameter(torch.randn(1,model_channels,1))
194
+ self.conditioning_timestep_integrator = TimestepEmbedSequential(
195
+ DiffusionLayer(model_channels, dropout, num_heads),
196
+ DiffusionLayer(model_channels, dropout, num_heads),
197
+ DiffusionLayer(model_channels, dropout, num_heads),
198
+ )
199
+
200
+ self.integrating_conv = nn.Conv1d(model_channels*2, model_channels, kernel_size=1)
201
+ self.mel_head = nn.Conv1d(model_channels, in_channels, kernel_size=3, padding=1)
202
+
203
+ self.layers = nn.ModuleList([DiffusionLayer(model_channels, dropout, num_heads) for _ in range(num_layers)] +
204
+ [ResBlock(model_channels, model_channels, dropout, dims=1, use_scale_shift_norm=True) for _ in range(3)])
205
+
206
+ self.out = nn.Sequential(
207
+ normalization(model_channels),
208
+ nn.SiLU(),
209
+ nn.Conv1d(model_channels, out_channels, 3, padding=1),
210
+ )
211
+
212
+ def get_grad_norm_parameter_groups(self):
213
+ groups = {
214
+ 'minicoder': list(self.contextual_embedder.parameters()),
215
+ 'layers': list(self.layers.parameters()),
216
+ 'code_converters': list(self.code_embedding.parameters()) + list(self.code_converter.parameters()) + list(self.latent_conditioner.parameters()) + list(self.latent_conditioner.parameters()),
217
+ 'timestep_integrator': list(self.conditioning_timestep_integrator.parameters()) + list(self.integrating_conv.parameters()),
218
+ 'time_embed': list(self.time_embed.parameters()),
219
+ }
220
+ return groups
221
+
222
+ def get_conditioning(self, conditioning_input):
223
+ speech_conditioning_input = conditioning_input.unsqueeze(1) if len(
224
+ conditioning_input.shape) == 3 else conditioning_input
225
+ conds = []
226
+ for j in range(speech_conditioning_input.shape[1]):
227
+ conds.append(self.contextual_embedder(speech_conditioning_input[:, j]))
228
+ conds = torch.cat(conds, dim=-1)
229
+ conds = conds.mean(dim=-1)
230
+ return conds
231
+
232
+ def timestep_independent(self, aligned_conditioning, conditioning_latent, expected_seq_len, return_code_pred):
233
+ # Shuffle aligned_latent to BxCxS format
234
+ if is_latent(aligned_conditioning):
235
+ aligned_conditioning = aligned_conditioning.permute(0, 2, 1)
236
+
237
+ cond_scale, cond_shift = torch.chunk(conditioning_latent, 2, dim=1)
238
+ if is_latent(aligned_conditioning):
239
+ code_emb = self.latent_conditioner(aligned_conditioning)
240
+ else:
241
+ code_emb = self.code_embedding(aligned_conditioning).permute(0, 2, 1)
242
+ code_emb = self.code_converter(code_emb)
243
+ code_emb = self.code_norm(code_emb) * (1 + cond_scale.unsqueeze(-1)) + cond_shift.unsqueeze(-1)
244
+
245
+ unconditioned_batches = torch.zeros((code_emb.shape[0], 1, 1), device=code_emb.device)
246
+ # Mask out the conditioning branch for whole batch elements, implementing something similar to classifier-free guidance.
247
+ if self.training and self.unconditioned_percentage > 0:
248
+ unconditioned_batches = torch.rand((code_emb.shape[0], 1, 1),
249
+ device=code_emb.device) < self.unconditioned_percentage
250
+ code_emb = torch.where(unconditioned_batches, self.unconditioned_embedding.repeat(aligned_conditioning.shape[0], 1, 1),
251
+ code_emb)
252
+ expanded_code_emb = F.interpolate(code_emb, size=expected_seq_len, mode='nearest')
253
+
254
+ if not return_code_pred:
255
+ return expanded_code_emb
256
+ else:
257
+ mel_pred = self.mel_head(expanded_code_emb)
258
+ # Multiply mel_pred by !unconditioned_branches, which drops the gradient on unconditioned branches. This is because we don't want that gradient being used to train parameters through the codes_embedder as it unbalances contributions to that network from the MSE loss.
259
+ mel_pred = mel_pred * unconditioned_batches.logical_not()
260
+ return expanded_code_emb, mel_pred
261
+
262
+ def forward(self, x, timesteps, aligned_conditioning=None, conditioning_latent=None, precomputed_aligned_embeddings=None, conditioning_free=False, return_code_pred=False):
263
+ """
264
+ Apply the model to an input batch.
265
+
266
+ :param x: an [N x C x ...] Tensor of inputs.
267
+ :param timesteps: a 1-D batch of timesteps.
268
+ :param aligned_conditioning: an aligned latent or sequence of tokens providing useful data about the sample to be produced.
269
+ :param conditioning_latent: a pre-computed conditioning latent; see get_conditioning().
270
+ :param precomputed_aligned_embeddings: Embeddings returned from self.timestep_independent()
271
+ :param conditioning_free: When set, all conditioning inputs (including tokens and conditioning_input) will not be considered.
272
+ :return: an [N x C x ...] Tensor of outputs.
273
+ """
274
+ assert precomputed_aligned_embeddings is not None or (aligned_conditioning is not None and conditioning_latent is not None)
275
+ assert not (return_code_pred and precomputed_aligned_embeddings is not None) # These two are mutually exclusive.
276
+
277
+ unused_params = []
278
+ if conditioning_free:
279
+ code_emb = self.unconditioned_embedding.repeat(x.shape[0], 1, x.shape[-1])
280
+ unused_params.extend(list(self.code_converter.parameters()) + list(self.code_embedding.parameters()))
281
+ unused_params.extend(list(self.latent_conditioner.parameters()))
282
+ else:
283
+ if precomputed_aligned_embeddings is not None:
284
+ code_emb = precomputed_aligned_embeddings
285
+ else:
286
+ code_emb, mel_pred = self.timestep_independent(aligned_conditioning, conditioning_latent, x.shape[-1], True)
287
+ if is_latent(aligned_conditioning):
288
+ unused_params.extend(list(self.code_converter.parameters()) + list(self.code_embedding.parameters()))
289
+ else:
290
+ unused_params.extend(list(self.latent_conditioner.parameters()))
291
+
292
+ unused_params.append(self.unconditioned_embedding)
293
+
294
+ time_emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
295
+ code_emb = self.conditioning_timestep_integrator(code_emb, time_emb)
296
+ x = self.inp_block(x)
297
+ x = torch.cat([x, code_emb], dim=1)
298
+ x = self.integrating_conv(x)
299
+ for i, lyr in enumerate(self.layers):
300
+ # Do layer drop where applicable. Do not drop first and last layers.
301
+ if self.training and self.layer_drop > 0 and i != 0 and i != (len(self.layers)-1) and random.random() < self.layer_drop:
302
+ unused_params.extend(list(lyr.parameters()))
303
+ else:
304
+ # First and last blocks will have autocast disabled for improved precision.
305
+ with autocast(x.device.type, enabled=self.enable_fp16 and i != 0):
306
+ x = lyr(x, time_emb)
307
+
308
+ x = x.float()
309
+ out = self.out(x)
310
+
311
+ # Involve probabilistic or possibly unused parameters in loss so we don't get DDP errors.
312
+ extraneous_addition = 0
313
+ for p in unused_params:
314
+ extraneous_addition = extraneous_addition + p.mean()
315
+ out = out + extraneous_addition * 0
316
+
317
+ if return_code_pred:
318
+ return out, mel_pred
319
+ return out
320
+
321
+
322
+ if __name__ == '__main__':
323
+ clip = torch.randn(2, 100, 400)
324
+ aligned_latent = torch.randn(2,388,512)
325
+ aligned_sequence = torch.randint(0,8192,(2,100))
326
+ cond = torch.randn(2, 100, 400)
327
+ ts = torch.LongTensor([600, 600])
328
+ model = DiffusionTts(512, layer_drop=.3, unconditioned_percentage=.5)
329
+ # Test with latent aligned conditioning
330
+ #o = model(clip, ts, aligned_latent, cond)
331
+ # Test with sequence aligned conditioning
332
+ o = model(clip, ts, aligned_sequence, cond)
333
+
tortoise/models/random_latent_generator.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+
7
+
8
+ def fused_leaky_relu(input, bias=None, negative_slope=0.2, scale=2 ** 0.5):
9
+ if bias is not None:
10
+ rest_dim = [1] * (input.ndim - bias.ndim - 1)
11
+ return (
12
+ F.leaky_relu(
13
+ input + bias.view(1, bias.shape[0], *rest_dim), negative_slope=negative_slope
14
+ )
15
+ * scale
16
+ )
17
+ else:
18
+ return F.leaky_relu(input, negative_slope=0.2) * scale
19
+
20
+
21
+ class EqualLinear(nn.Module):
22
+ def __init__(
23
+ self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1
24
+ ):
25
+ super().__init__()
26
+ self.weight = nn.Parameter(torch.randn(out_dim, in_dim).div_(lr_mul))
27
+ if bias:
28
+ self.bias = nn.Parameter(torch.zeros(out_dim).fill_(bias_init))
29
+ else:
30
+ self.bias = None
31
+ self.scale = (1 / math.sqrt(in_dim)) * lr_mul
32
+ self.lr_mul = lr_mul
33
+
34
+ def forward(self, input):
35
+ out = F.linear(input, self.weight * self.scale)
36
+ out = fused_leaky_relu(out, self.bias * self.lr_mul)
37
+ return out
38
+
39
+
40
+ class RandomLatentConverter(nn.Module):
41
+ def __init__(self, channels):
42
+ super().__init__()
43
+ self.layers = nn.Sequential(*[EqualLinear(channels, channels, lr_mul=.1) for _ in range(5)],
44
+ nn.Linear(channels, channels))
45
+ self.channels = channels
46
+
47
+ def forward(self, ref):
48
+ r = torch.randn(ref.shape[0], self.channels, device=ref.device)
49
+ y = self.layers(r)
50
+ return y
51
+
52
+
53
+ if __name__ == '__main__':
54
+ model = RandomLatentConverter(512)
55
+ model(torch.randn(5,512))
tortoise/models/transformer.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from functools import partial
2
+
3
+ import torch
4
+ import torch.nn.functional as F
5
+ from einops import rearrange
6
+ from rotary_embedding_torch import RotaryEmbedding, broadcat
7
+ from torch import nn
8
+
9
+
10
+ # helpers
11
+
12
+
13
+ def exists(val):
14
+ return val is not None
15
+
16
+
17
+ def default(val, d):
18
+ return val if exists(val) else d
19
+
20
+
21
+ def cast_tuple(val, depth = 1):
22
+ if isinstance(val, list):
23
+ val = tuple(val)
24
+ return val if isinstance(val, tuple) else (val,) * depth
25
+
26
+
27
+ def max_neg_value(t):
28
+ return -torch.finfo(t.dtype).max
29
+
30
+
31
+ def stable_softmax(t, dim = -1, alpha = 32 ** 2):
32
+ t = t / alpha
33
+ t = t - torch.amax(t, dim = dim, keepdim = True).detach()
34
+ return (t * alpha).softmax(dim = dim)
35
+
36
+
37
+ def route_args(router, args, depth):
38
+ routed_args = [(dict(), dict()) for _ in range(depth)]
39
+ matched_keys = [key for key in args.keys() if key in router]
40
+
41
+ for key in matched_keys:
42
+ val = args[key]
43
+ for depth, ((f_args, g_args), routes) in enumerate(zip(routed_args, router[key])):
44
+ new_f_args, new_g_args = map(lambda route: ({key: val} if route else {}), routes)
45
+ routed_args[depth] = ({**f_args, **new_f_args}, {**g_args, **new_g_args})
46
+ return routed_args
47
+
48
+
49
+ # classes
50
+ class SequentialSequence(nn.Module):
51
+ def __init__(self, layers, args_route = {}, layer_dropout = 0.):
52
+ super().__init__()
53
+ assert all(len(route) == len(layers) for route in args_route.values()), 'each argument route map must have the same depth as the number of sequential layers'
54
+ self.layers = layers
55
+ self.args_route = args_route
56
+ self.layer_dropout = layer_dropout
57
+
58
+ def forward(self, x, **kwargs):
59
+ args = route_args(self.args_route, kwargs, len(self.layers))
60
+ layers_and_args = list(zip(self.layers, args))
61
+
62
+ for (f, g), (f_args, g_args) in layers_and_args:
63
+ x = x + f(x, **f_args)
64
+ x = x + g(x, **g_args)
65
+ return x
66
+
67
+
68
+ class DivideMax(nn.Module):
69
+ def __init__(self, dim):
70
+ super().__init__()
71
+ self.dim = dim
72
+
73
+ def forward(self, x):
74
+ maxes = x.amax(dim = self.dim, keepdim = True).detach()
75
+ return x / maxes
76
+
77
+
78
+ # https://arxiv.org/abs/2103.17239
79
+ class LayerScale(nn.Module):
80
+ def __init__(self, dim, depth, fn):
81
+ super().__init__()
82
+ if depth <= 18:
83
+ init_eps = 0.1
84
+ elif depth > 18 and depth <= 24:
85
+ init_eps = 1e-5
86
+ else:
87
+ init_eps = 1e-6
88
+
89
+ scale = torch.zeros(1, 1, dim).fill_(init_eps)
90
+ self.scale = nn.Parameter(scale)
91
+ self.fn = fn
92
+ def forward(self, x, **kwargs):
93
+ return self.fn(x, **kwargs) * self.scale
94
+
95
+ # layer norm
96
+
97
+
98
+ class PreNorm(nn.Module):
99
+ def __init__(self, dim, fn, sandwich = False):
100
+ super().__init__()
101
+ self.norm = nn.LayerNorm(dim)
102
+ self.norm_out = nn.LayerNorm(dim) if sandwich else nn.Identity()
103
+ self.fn = fn
104
+
105
+ def forward(self, x, **kwargs):
106
+ x = self.norm(x)
107
+ x = self.fn(x, **kwargs)
108
+ return self.norm_out(x)
109
+
110
+ # feed forward
111
+
112
+
113
+ class GEGLU(nn.Module):
114
+ def forward(self, x):
115
+ x, gates = x.chunk(2, dim = -1)
116
+ return x * F.gelu(gates)
117
+
118
+
119
+ class FeedForward(nn.Module):
120
+ def __init__(self, dim, dropout = 0., mult = 4.):
121
+ super().__init__()
122
+ self.net = nn.Sequential(
123
+ nn.Linear(dim, dim * mult * 2),
124
+ GEGLU(),
125
+ nn.Dropout(dropout),
126
+ nn.Linear(dim * mult, dim)
127
+ )
128
+
129
+ def forward(self, x):
130
+ return self.net(x)
131
+
132
+ # Attention
133
+
134
+
135
+ class Attention(nn.Module):
136
+ def __init__(self, dim, seq_len, causal = True, heads = 8, dim_head = 64, dropout = 0.):
137
+ super().__init__()
138
+ inner_dim = dim_head * heads
139
+ self.heads = heads
140
+ self.seq_len = seq_len
141
+ self.scale = dim_head ** -0.5
142
+
143
+ self.causal = causal
144
+
145
+ self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
146
+ self.to_out = nn.Sequential(
147
+ nn.Linear(inner_dim, dim),
148
+ nn.Dropout(dropout)
149
+ )
150
+
151
+ def forward(self, x, mask = None):
152
+ b, n, _, h, device = *x.shape, self.heads, x.device
153
+ softmax = torch.softmax
154
+
155
+ qkv = self.to_qkv(x).chunk(3, dim = -1)
156
+ q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
157
+
158
+ q = q * self.scale
159
+
160
+ dots = torch.einsum('b h i d, b h j d -> b h i j', q, k)
161
+ mask_value = max_neg_value(dots)
162
+
163
+ if exists(mask):
164
+ mask = rearrange(mask, 'b j -> b () () j')
165
+ dots.masked_fill_(~mask, mask_value)
166
+ del mask
167
+
168
+ if self.causal:
169
+ i, j = dots.shape[-2:]
170
+ mask = torch.ones(i, j, device = device).triu_(j - i + 1).bool()
171
+ dots.masked_fill_(mask, mask_value)
172
+
173
+ attn = softmax(dots, dim=-1)
174
+
175
+ out = torch.einsum('b h i j, b h j d -> b h i d', attn, v)
176
+ out = rearrange(out, 'b h n d -> b n (h d)')
177
+ out = self.to_out(out)
178
+ return out
179
+
180
+
181
+ # main transformer class
182
+ class Transformer(nn.Module):
183
+ def __init__(
184
+ self,
185
+ *,
186
+ dim,
187
+ depth,
188
+ seq_len,
189
+ causal = True,
190
+ heads = 8,
191
+ dim_head = 64,
192
+ ff_mult = 4,
193
+ attn_dropout = 0.,
194
+ ff_dropout = 0.,
195
+ sparse_attn = False,
196
+ sandwich_norm = False,
197
+ ):
198
+ super().__init__()
199
+ layers = nn.ModuleList([])
200
+ sparse_layer = cast_tuple(sparse_attn, depth)
201
+
202
+ for ind, sparse_attn in zip(range(depth), sparse_layer):
203
+ attn = Attention(dim, causal = causal, seq_len = seq_len, heads = heads, dim_head = dim_head, dropout = attn_dropout)
204
+
205
+ ff = FeedForward(dim, mult = ff_mult, dropout = ff_dropout)
206
+
207
+ layers.append(nn.ModuleList([
208
+ LayerScale(dim, ind + 1, PreNorm(dim, attn, sandwich = sandwich_norm)),
209
+ LayerScale(dim, ind + 1, PreNorm(dim, ff, sandwich = sandwich_norm))
210
+ ]))
211
+
212
+ execute_type = SequentialSequence
213
+ route_attn = ((True, False),) * depth
214
+ attn_route_map = {'mask': route_attn}
215
+
216
+ self.layers = execute_type(layers, args_route = attn_route_map)
217
+
218
+ def forward(self, x, **kwargs):
219
+ return self.layers(x, **kwargs)
tortoise/models/vocoder.py ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+
5
+ MAX_WAV_VALUE = 32768.0
6
+
7
+ class KernelPredictor(torch.nn.Module):
8
+ ''' Kernel predictor for the location-variable convolutions'''
9
+
10
+ def __init__(
11
+ self,
12
+ cond_channels,
13
+ conv_in_channels,
14
+ conv_out_channels,
15
+ conv_layers,
16
+ conv_kernel_size=3,
17
+ kpnet_hidden_channels=64,
18
+ kpnet_conv_size=3,
19
+ kpnet_dropout=0.0,
20
+ kpnet_nonlinear_activation="LeakyReLU",
21
+ kpnet_nonlinear_activation_params={"negative_slope": 0.1},
22
+ ):
23
+ '''
24
+ Args:
25
+ cond_channels (int): number of channel for the conditioning sequence,
26
+ conv_in_channels (int): number of channel for the input sequence,
27
+ conv_out_channels (int): number of channel for the output sequence,
28
+ conv_layers (int): number of layers
29
+ '''
30
+ super().__init__()
31
+
32
+ self.conv_in_channels = conv_in_channels
33
+ self.conv_out_channels = conv_out_channels
34
+ self.conv_kernel_size = conv_kernel_size
35
+ self.conv_layers = conv_layers
36
+
37
+ kpnet_kernel_channels = conv_in_channels * conv_out_channels * conv_kernel_size * conv_layers # l_w
38
+ kpnet_bias_channels = conv_out_channels * conv_layers # l_b
39
+
40
+ self.input_conv = nn.Sequential(
41
+ nn.utils.weight_norm(nn.Conv1d(cond_channels, kpnet_hidden_channels, 5, padding=2, bias=True)),
42
+ getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
43
+ )
44
+
45
+ self.residual_convs = nn.ModuleList()
46
+ padding = (kpnet_conv_size - 1) // 2
47
+ for _ in range(3):
48
+ self.residual_convs.append(
49
+ nn.Sequential(
50
+ nn.Dropout(kpnet_dropout),
51
+ nn.utils.weight_norm(
52
+ nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding,
53
+ bias=True)),
54
+ getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
55
+ nn.utils.weight_norm(
56
+ nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding,
57
+ bias=True)),
58
+ getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params),
59
+ )
60
+ )
61
+ self.kernel_conv = nn.utils.weight_norm(
62
+ nn.Conv1d(kpnet_hidden_channels, kpnet_kernel_channels, kpnet_conv_size, padding=padding, bias=True))
63
+ self.bias_conv = nn.utils.weight_norm(
64
+ nn.Conv1d(kpnet_hidden_channels, kpnet_bias_channels, kpnet_conv_size, padding=padding, bias=True))
65
+
66
+ def forward(self, c):
67
+ '''
68
+ Args:
69
+ c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
70
+ '''
71
+ batch, _, cond_length = c.shape
72
+ c = self.input_conv(c)
73
+ for residual_conv in self.residual_convs:
74
+ residual_conv.to(c.device)
75
+ c = c + residual_conv(c)
76
+ k = self.kernel_conv(c)
77
+ b = self.bias_conv(c)
78
+ kernels = k.contiguous().view(
79
+ batch,
80
+ self.conv_layers,
81
+ self.conv_in_channels,
82
+ self.conv_out_channels,
83
+ self.conv_kernel_size,
84
+ cond_length,
85
+ )
86
+ bias = b.contiguous().view(
87
+ batch,
88
+ self.conv_layers,
89
+ self.conv_out_channels,
90
+ cond_length,
91
+ )
92
+
93
+ return kernels, bias
94
+
95
+ def remove_weight_norm(self):
96
+ nn.utils.remove_weight_norm(self.input_conv[0])
97
+ nn.utils.remove_weight_norm(self.kernel_conv)
98
+ nn.utils.remove_weight_norm(self.bias_conv)
99
+ for block in self.residual_convs:
100
+ nn.utils.remove_weight_norm(block[1])
101
+ nn.utils.remove_weight_norm(block[3])
102
+
103
+
104
+ class LVCBlock(torch.nn.Module):
105
+ '''the location-variable convolutions'''
106
+
107
+ def __init__(
108
+ self,
109
+ in_channels,
110
+ cond_channels,
111
+ stride,
112
+ dilations=[1, 3, 9, 27],
113
+ lReLU_slope=0.2,
114
+ conv_kernel_size=3,
115
+ cond_hop_length=256,
116
+ kpnet_hidden_channels=64,
117
+ kpnet_conv_size=3,
118
+ kpnet_dropout=0.0,
119
+ ):
120
+ super().__init__()
121
+
122
+ self.cond_hop_length = cond_hop_length
123
+ self.conv_layers = len(dilations)
124
+ self.conv_kernel_size = conv_kernel_size
125
+
126
+ self.kernel_predictor = KernelPredictor(
127
+ cond_channels=cond_channels,
128
+ conv_in_channels=in_channels,
129
+ conv_out_channels=2 * in_channels,
130
+ conv_layers=len(dilations),
131
+ conv_kernel_size=conv_kernel_size,
132
+ kpnet_hidden_channels=kpnet_hidden_channels,
133
+ kpnet_conv_size=kpnet_conv_size,
134
+ kpnet_dropout=kpnet_dropout,
135
+ kpnet_nonlinear_activation_params={"negative_slope": lReLU_slope}
136
+ )
137
+
138
+ self.convt_pre = nn.Sequential(
139
+ nn.LeakyReLU(lReLU_slope),
140
+ nn.utils.weight_norm(nn.ConvTranspose1d(in_channels, in_channels, 2 * stride, stride=stride,
141
+ padding=stride // 2 + stride % 2, output_padding=stride % 2)),
142
+ )
143
+
144
+ self.conv_blocks = nn.ModuleList()
145
+ for dilation in dilations:
146
+ self.conv_blocks.append(
147
+ nn.Sequential(
148
+ nn.LeakyReLU(lReLU_slope),
149
+ nn.utils.weight_norm(nn.Conv1d(in_channels, in_channels, conv_kernel_size,
150
+ padding=dilation * (conv_kernel_size - 1) // 2, dilation=dilation)),
151
+ nn.LeakyReLU(lReLU_slope),
152
+ )
153
+ )
154
+
155
+ def forward(self, x, c):
156
+ ''' forward propagation of the location-variable convolutions.
157
+ Args:
158
+ x (Tensor): the input sequence (batch, in_channels, in_length)
159
+ c (Tensor): the conditioning sequence (batch, cond_channels, cond_length)
160
+
161
+ Returns:
162
+ Tensor: the output sequence (batch, in_channels, in_length)
163
+ '''
164
+ _, in_channels, _ = x.shape # (B, c_g, L')
165
+
166
+ x = self.convt_pre(x) # (B, c_g, stride * L')
167
+ kernels, bias = self.kernel_predictor(c)
168
+
169
+ for i, conv in enumerate(self.conv_blocks):
170
+ output = conv(x) # (B, c_g, stride * L')
171
+
172
+ k = kernels[:, i, :, :, :, :] # (B, 2 * c_g, c_g, kernel_size, cond_length)
173
+ b = bias[:, i, :, :] # (B, 2 * c_g, cond_length)
174
+
175
+ output = self.location_variable_convolution(output, k, b,
176
+ hop_size=self.cond_hop_length) # (B, 2 * c_g, stride * L'): LVC
177
+ x = x + torch.sigmoid(output[:, :in_channels, :]) * torch.tanh(
178
+ output[:, in_channels:, :]) # (B, c_g, stride * L'): GAU
179
+
180
+ return x
181
+
182
+ def location_variable_convolution(self, x, kernel, bias, dilation=1, hop_size=256):
183
+ ''' perform location-variable convolution operation on the input sequence (x) using the local convolution kernl.
184
+ Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100.
185
+ Args:
186
+ x (Tensor): the input sequence (batch, in_channels, in_length).
187
+ kernel (Tensor): the local convolution kernel (batch, in_channel, out_channels, kernel_size, kernel_length)
188
+ bias (Tensor): the bias for the local convolution (batch, out_channels, kernel_length)
189
+ dilation (int): the dilation of convolution.
190
+ hop_size (int): the hop_size of the conditioning sequence.
191
+ Returns:
192
+ (Tensor): the output sequence after performing local convolution. (batch, out_channels, in_length).
193
+ '''
194
+ batch, _, in_length = x.shape
195
+ batch, _, out_channels, kernel_size, kernel_length = kernel.shape
196
+ assert in_length == (kernel_length * hop_size), "length of (x, kernel) is not matched"
197
+
198
+ padding = dilation * int((kernel_size - 1) / 2)
199
+ x = F.pad(x, (padding, padding), 'constant', 0) # (batch, in_channels, in_length + 2*padding)
200
+ x = x.unfold(2, hop_size + 2 * padding, hop_size) # (batch, in_channels, kernel_length, hop_size + 2*padding)
201
+
202
+ if hop_size < dilation:
203
+ x = F.pad(x, (0, dilation), 'constant', 0)
204
+ x = x.unfold(3, dilation,
205
+ dilation) # (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation)
206
+ x = x[:, :, :, :, :hop_size]
207
+ x = x.transpose(3, 4) # (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation)
208
+ x = x.unfold(4, kernel_size, 1) # (batch, in_channels, kernel_length, dilation, _, kernel_size)
209
+
210
+ o = torch.einsum('bildsk,biokl->bolsd', x, kernel)
211
+ o = o.to(memory_format=torch.channels_last_3d)
212
+ bias = bias.unsqueeze(-1).unsqueeze(-1).to(memory_format=torch.channels_last_3d)
213
+ o = o + bias
214
+ o = o.contiguous().view(batch, out_channels, -1)
215
+
216
+ return o
217
+
218
+ def remove_weight_norm(self):
219
+ self.kernel_predictor.remove_weight_norm()
220
+ nn.utils.remove_weight_norm(self.convt_pre[1])
221
+ for block in self.conv_blocks:
222
+ nn.utils.remove_weight_norm(block[1])
223
+
224
+
225
+ class UnivNetGenerator(nn.Module):
226
+ """UnivNet Generator"""
227
+
228
+ def __init__(self, noise_dim=64, channel_size=32, dilations=[1,3,9,27], strides=[8,8,4], lReLU_slope=.2, kpnet_conv_size=3,
229
+ # Below are MEL configurations options that this generator requires.
230
+ hop_length=256, n_mel_channels=100):
231
+ super(UnivNetGenerator, self).__init__()
232
+ self.mel_channel = n_mel_channels
233
+ self.noise_dim = noise_dim
234
+ self.hop_length = hop_length
235
+ channel_size = channel_size
236
+ kpnet_conv_size = kpnet_conv_size
237
+
238
+ self.res_stack = nn.ModuleList()
239
+ hop_length = 1
240
+ for stride in strides:
241
+ hop_length = stride * hop_length
242
+ self.res_stack.append(
243
+ LVCBlock(
244
+ channel_size,
245
+ n_mel_channels,
246
+ stride=stride,
247
+ dilations=dilations,
248
+ lReLU_slope=lReLU_slope,
249
+ cond_hop_length=hop_length,
250
+ kpnet_conv_size=kpnet_conv_size
251
+ )
252
+ )
253
+
254
+ self.conv_pre = \
255
+ nn.utils.weight_norm(nn.Conv1d(noise_dim, channel_size, 7, padding=3, padding_mode='reflect'))
256
+
257
+ self.conv_post = nn.Sequential(
258
+ nn.LeakyReLU(lReLU_slope),
259
+ nn.utils.weight_norm(nn.Conv1d(channel_size, 1, 7, padding=3, padding_mode='reflect')),
260
+ nn.Tanh(),
261
+ )
262
+
263
+ def forward(self, c, z):
264
+ '''
265
+ Args:
266
+ c (Tensor): the conditioning sequence of mel-spectrogram (batch, mel_channels, in_length)
267
+ z (Tensor): the noise sequence (batch, noise_dim, in_length)
268
+
269
+ '''
270
+ z = self.conv_pre(z) # (B, c_g, L)
271
+
272
+ for res_block in self.res_stack:
273
+ res_block.to(z.device)
274
+ z = res_block(z, c) # (B, c_g, L * s_0 * ... * s_i)
275
+
276
+ z = self.conv_post(z) # (B, 1, L * 256)
277
+
278
+ return z
279
+
280
+ def eval(self, inference=False):
281
+ super(UnivNetGenerator, self).eval()
282
+ # don't remove weight norm while validation in training loop
283
+ if inference:
284
+ self.remove_weight_norm()
285
+
286
+ def remove_weight_norm(self):
287
+ print('Removing weight norm...')
288
+
289
+ nn.utils.remove_weight_norm(self.conv_pre)
290
+
291
+ for layer in self.conv_post:
292
+ if len(layer.state_dict()) != 0:
293
+ nn.utils.remove_weight_norm(layer)
294
+
295
+ for res_block in self.res_stack:
296
+ res_block.remove_weight_norm()
297
+
298
+ def inference(self, c, z=None):
299
+ # pad input mel with zeros to cut artifact
300
+ # see https://github.com/seungwonpark/melgan/issues/8
301
+ zero = torch.full((c.shape[0], self.mel_channel, 10), -11.5129).to(c.device)
302
+ mel = torch.cat((c, zero), dim=2)
303
+
304
+ if z is None:
305
+ z = torch.randn(c.shape[0], self.noise_dim, mel.size(2)).to(mel.device)
306
+
307
+ audio = self.forward(mel, z)
308
+ audio = audio[:, :, :-(self.hop_length * 10)]
309
+ audio = audio.clamp(min=-1, max=1)
310
+ return audio
311
+
312
+
313
+ if __name__ == '__main__':
314
+ model = UnivNetGenerator()
315
+
316
+ c = torch.randn(3, 100, 10)
317
+ z = torch.randn(3, 64, 10)
318
+ print(c.shape)
319
+
320
+ y = model(c, z)
321
+ print(y.shape)
322
+ assert y.shape == torch.Size([3, 1, 2560])
323
+
324
+ pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
325
+ print(pytorch_total_params)
tortoise/models/xtransformers.py ADDED
@@ -0,0 +1,1252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import functools
2
+ import math
3
+ import torch
4
+ from torch import nn, einsum
5
+ import torch.nn.functional as F
6
+ from functools import partial
7
+ from inspect import isfunction
8
+ from collections import namedtuple
9
+
10
+ from einops import rearrange, repeat, reduce
11
+ from einops.layers.torch import Rearrange
12
+
13
+ from torch.utils.checkpoint import checkpoint
14
+
15
+ DEFAULT_DIM_HEAD = 64
16
+
17
+ Intermediates = namedtuple('Intermediates', [
18
+ 'pre_softmax_attn',
19
+ 'post_softmax_attn'
20
+ ])
21
+
22
+ LayerIntermediates = namedtuple('Intermediates', [
23
+ 'hiddens',
24
+ 'attn_intermediates',
25
+ 'past_key_values',
26
+ ])
27
+
28
+
29
+ # helpers
30
+
31
+ def exists(val):
32
+ return val is not None
33
+
34
+
35
+ def default(val, d):
36
+ if exists(val):
37
+ return val
38
+ return d() if isfunction(d) else d
39
+
40
+
41
+ def cast_tuple(val, depth):
42
+ return val if isinstance(val, tuple) else (val,) * depth
43
+
44
+
45
+ class always():
46
+ def __init__(self, val):
47
+ self.val = val
48
+
49
+ def __call__(self, *args, **kwargs):
50
+ return self.val
51
+
52
+
53
+ class not_equals():
54
+ def __init__(self, val):
55
+ self.val = val
56
+
57
+ def __call__(self, x, *args, **kwargs):
58
+ return x != self.val
59
+
60
+
61
+ class equals():
62
+ def __init__(self, val):
63
+ self.val = val
64
+
65
+ def __call__(self, x, *args, **kwargs):
66
+ return x == self.val
67
+
68
+
69
+ def max_neg_value(tensor):
70
+ return -torch.finfo(tensor.dtype).max
71
+
72
+
73
+ def l2norm(t):
74
+ return F.normalize(t, p=2, dim=-1)
75
+
76
+
77
+ # init helpers
78
+
79
+ def init_zero_(layer):
80
+ nn.init.constant_(layer.weight, 0.)
81
+ if exists(layer.bias):
82
+ nn.init.constant_(layer.bias, 0.)
83
+
84
+
85
+ # keyword argument helpers
86
+
87
+ def pick_and_pop(keys, d):
88
+ values = list(map(lambda key: d.pop(key), keys))
89
+ return dict(zip(keys, values))
90
+
91
+
92
+ def group_dict_by_key(cond, d):
93
+ return_val = [dict(), dict()]
94
+ for key in d.keys():
95
+ match = bool(cond(key))
96
+ ind = int(not match)
97
+ return_val[ind][key] = d[key]
98
+ return (*return_val,)
99
+
100
+
101
+ def string_begins_with(prefix, str):
102
+ return str.startswith(prefix)
103
+
104
+
105
+ def group_by_key_prefix(prefix, d):
106
+ return group_dict_by_key(partial(string_begins_with, prefix), d)
107
+
108
+
109
+ def groupby_prefix_and_trim(prefix, d):
110
+ kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
111
+ kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items())))
112
+ return kwargs_without_prefix, kwargs
113
+
114
+
115
+ # activations
116
+
117
+ class ReluSquared(nn.Module):
118
+ def forward(self, x):
119
+ return F.relu(x) ** 2
120
+
121
+
122
+ # positional embeddings
123
+
124
+ class AbsolutePositionalEmbedding(nn.Module):
125
+ def __init__(self, dim, max_seq_len):
126
+ super().__init__()
127
+ self.scale = dim ** -0.5
128
+ self.emb = nn.Embedding(max_seq_len, dim)
129
+
130
+ def forward(self, x):
131
+ n = torch.arange(x.shape[1], device=x.device)
132
+ pos_emb = self.emb(n)
133
+ pos_emb = rearrange(pos_emb, 'n d -> () n d')
134
+ return pos_emb * self.scale
135
+
136
+
137
+ class FixedPositionalEmbedding(nn.Module):
138
+ def __init__(self, dim):
139
+ super().__init__()
140
+ inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim))
141
+ self.register_buffer('inv_freq', inv_freq)
142
+
143
+ def forward(self, x, seq_dim=1, offset=0):
144
+ t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq) + offset
145
+ sinusoid_inp = torch.einsum('i , j -> i j', t, self.inv_freq)
146
+ emb = torch.cat((sinusoid_inp.sin(), sinusoid_inp.cos()), dim=-1)
147
+ return rearrange(emb, 'n d -> () n d')
148
+
149
+
150
+ class RelativePositionBias(nn.Module):
151
+ def __init__(self, scale, causal=False, num_buckets=32, max_distance=128, heads=8):
152
+ super().__init__()
153
+ self.scale = scale
154
+ self.causal = causal
155
+ self.num_buckets = num_buckets
156
+ self.max_distance = max_distance
157
+ self.relative_attention_bias = nn.Embedding(num_buckets, heads)
158
+
159
+ @staticmethod
160
+ def _relative_position_bucket(relative_position, causal=True, num_buckets=32, max_distance=128):
161
+ ret = 0
162
+ n = -relative_position
163
+ if not causal:
164
+ num_buckets //= 2
165
+ ret += (n < 0).long() * num_buckets
166
+ n = torch.abs(n)
167
+ else:
168
+ n = torch.max(n, torch.zeros_like(n))
169
+
170
+ max_exact = num_buckets // 2
171
+ is_small = n < max_exact
172
+
173
+ val_if_large = max_exact + (
174
+ torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact)
175
+ ).long()
176
+ val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))
177
+
178
+ ret += torch.where(is_small, n, val_if_large)
179
+ return ret
180
+
181
+ def forward(self, qk_dots):
182
+ i, j, device = *qk_dots.shape[-2:], qk_dots.device
183
+ q_pos = torch.arange(i, dtype=torch.long, device=device)
184
+ k_pos = torch.arange(j, dtype=torch.long, device=device)
185
+ rel_pos = k_pos[None, :] - q_pos[:, None]
186
+ rp_bucket = self._relative_position_bucket(rel_pos, causal=self.causal, num_buckets=self.num_buckets,
187
+ max_distance=self.max_distance)
188
+ values = self.relative_attention_bias(rp_bucket)
189
+ bias = rearrange(values, 'i j h -> () h i j')
190
+ return qk_dots + (bias * self.scale)
191
+
192
+
193
+ class AlibiPositionalBias(nn.Module):
194
+ def __init__(self, heads, **kwargs):
195
+ super().__init__()
196
+ self.heads = heads
197
+ slopes = torch.Tensor(self._get_slopes(heads))
198
+ slopes = rearrange(slopes, 'h -> () h () ()')
199
+ self.register_buffer('slopes', slopes, persistent=False)
200
+ self.register_buffer('bias', None, persistent=False)
201
+
202
+ @staticmethod
203
+ def _get_slopes(heads):
204
+ def get_slopes_power_of_2(n):
205
+ start = (2 ** (-2 ** -(math.log2(n) - 3)))
206
+ ratio = start
207
+ return [start * ratio ** i for i in range(n)]
208
+
209
+ if math.log2(heads).is_integer():
210
+ return get_slopes_power_of_2(heads)
211
+
212
+ closest_power_of_2 = 2 ** math.floor(math.log2(heads))
213
+ return get_slopes_power_of_2(closest_power_of_2) + get_slopes_power_of_2(2 * closest_power_of_2)[0::2][
214
+ :heads - closest_power_of_2]
215
+
216
+ def forward(self, qk_dots):
217
+ h, i, j, device = *qk_dots.shape[-3:], qk_dots.device
218
+
219
+ if exists(self.bias) and self.bias.shape[-1] >= j:
220
+ return qk_dots + self.bias[..., :j]
221
+
222
+ bias = torch.arange(j, device=device)
223
+ bias = rearrange(bias, 'j -> () () () j')
224
+ bias = bias * self.slopes
225
+
226
+ num_heads_unalibied = h - bias.shape[1]
227
+ bias = F.pad(bias, (0, 0, 0, 0, 0, num_heads_unalibied))
228
+
229
+ self.register_buffer('bias', bias, persistent=False)
230
+ return qk_dots + self.bias
231
+
232
+
233
+ class LearnedAlibiPositionalBias(AlibiPositionalBias):
234
+ def __init__(self, heads, bidirectional=False):
235
+ super().__init__(heads)
236
+ los_slopes = torch.log(self.slopes)
237
+ self.learned_logslopes = nn.Parameter(los_slopes)
238
+
239
+ self.bidirectional = bidirectional
240
+ if self.bidirectional:
241
+ self.learned_logslopes_future = nn.Parameter(los_slopes)
242
+
243
+ def forward(self, qk_dots):
244
+ h, i, j, device = *qk_dots.shape[-3:], qk_dots.device
245
+
246
+ def get_slopes(param):
247
+ return F.pad(param.exp(), (0, 0, 0, 0, 0, h - param.shape[1]))
248
+
249
+ if exists(self.bias) and self.bias.shape[-1] >= j:
250
+ bias = self.bias[..., :i, :j]
251
+ else:
252
+ i_arange = torch.arange(i, device=device)
253
+ j_arange = torch.arange(j, device=device)
254
+ bias = rearrange(j_arange, 'j -> 1 1 1 j') - rearrange(i_arange, 'i -> 1 1 i 1')
255
+ self.register_buffer('bias', bias, persistent=False)
256
+
257
+ if self.bidirectional:
258
+ past_slopes = get_slopes(self.learned_logslopes)
259
+ future_slopes = get_slopes(self.learned_logslopes_future)
260
+ bias = torch.tril(bias * past_slopes) + torch.triu(bias * future_slopes)
261
+ else:
262
+ slopes = get_slopes(self.learned_logslopes)
263
+ bias = bias * slopes
264
+
265
+ return qk_dots + bias
266
+
267
+
268
+ class RotaryEmbedding(nn.Module):
269
+ def __init__(self, dim):
270
+ super().__init__()
271
+ inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim))
272
+ self.register_buffer('inv_freq', inv_freq)
273
+
274
+ def forward(self, max_seq_len, device):
275
+ t = torch.arange(max_seq_len, device=device).type_as(self.inv_freq)
276
+ freqs = torch.einsum('i , j -> i j', t, self.inv_freq)
277
+ emb = torch.cat((freqs, freqs), dim=-1)
278
+ return rearrange(emb, 'n d -> () () n d')
279
+
280
+
281
+ def rotate_half(x):
282
+ x = rearrange(x, '... (j d) -> ... j d', j=2)
283
+ x1, x2 = x.unbind(dim=-2)
284
+ return torch.cat((-x2, x1), dim=-1)
285
+
286
+
287
+ def apply_rotary_pos_emb(t, freqs):
288
+ seq_len = t.shape[-2]
289
+ freqs = freqs[:, :, -seq_len:]
290
+ return (t * freqs.cos()) + (rotate_half(t) * freqs.sin())
291
+
292
+
293
+ # norms
294
+
295
+ class Scale(nn.Module):
296
+ def __init__(self, value, fn):
297
+ super().__init__()
298
+ self.value = value
299
+ self.fn = fn
300
+
301
+ def forward(self, x, **kwargs):
302
+ out = self.fn(x, **kwargs)
303
+ scale_fn = lambda t: t * self.value
304
+
305
+ if not isinstance(out, tuple):
306
+ return scale_fn(out)
307
+
308
+ return (scale_fn(out[0]), *out[1:])
309
+
310
+
311
+ class Rezero(nn.Module):
312
+ def __init__(self, fn):
313
+ super().__init__()
314
+ self.fn = fn
315
+ self.g = nn.Parameter(torch.zeros(1))
316
+
317
+ def forward(self, x, **kwargs):
318
+ out = self.fn(x, **kwargs)
319
+ rezero_fn = lambda t: t * self.g
320
+
321
+ if not isinstance(out, tuple):
322
+ return rezero_fn(out)
323
+
324
+ return (rezero_fn(out[0]), *out[1:])
325
+
326
+
327
+ class ScaleNorm(nn.Module):
328
+ def __init__(self, dim, eps=1e-5):
329
+ super().__init__()
330
+ self.scale = dim ** -0.5
331
+ self.eps = eps
332
+ self.g = nn.Parameter(torch.ones(1))
333
+
334
+ def forward(self, x):
335
+ norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
336
+ return x / norm.clamp(min=self.eps) * self.g
337
+
338
+
339
+ class RMSNorm(nn.Module):
340
+ def __init__(self, dim, eps=1e-8):
341
+ super().__init__()
342
+ self.scale = dim ** -0.5
343
+ self.eps = eps
344
+ self.g = nn.Parameter(torch.ones(dim))
345
+
346
+ def forward(self, x):
347
+ norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
348
+ return x / norm.clamp(min=self.eps) * self.g
349
+
350
+
351
+ class RMSScaleShiftNorm(nn.Module):
352
+ def __init__(self, dim, eps=1e-8):
353
+ super().__init__()
354
+ self.scale = dim ** -0.5
355
+ self.eps = eps
356
+ self.g = nn.Parameter(torch.ones(dim))
357
+ self.scale_shift_process = nn.Linear(dim * 2, dim * 2)
358
+
359
+ def forward(self, x, norm_scale_shift_inp):
360
+ norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
361
+ norm = x / norm.clamp(min=self.eps) * self.g
362
+
363
+ ss_emb = self.scale_shift_process(norm_scale_shift_inp)
364
+ scale, shift = torch.chunk(ss_emb, 2, dim=1)
365
+ h = norm * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
366
+ return h
367
+
368
+
369
+ # residual and residual gates
370
+
371
+ class Residual(nn.Module):
372
+ def __init__(self, dim, scale_residual=False):
373
+ super().__init__()
374
+ self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None
375
+
376
+ def forward(self, x, residual):
377
+ if exists(self.residual_scale):
378
+ residual = residual * self.residual_scale
379
+
380
+ return x + residual
381
+
382
+
383
+ class GRUGating(nn.Module):
384
+ def __init__(self, dim, scale_residual=False):
385
+ super().__init__()
386
+ self.gru = nn.GRUCell(dim, dim)
387
+ self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None
388
+
389
+ def forward(self, x, residual):
390
+ if exists(self.residual_scale):
391
+ residual = residual * self.residual_scale
392
+
393
+ gated_output = self.gru(
394
+ rearrange(x, 'b n d -> (b n) d'),
395
+ rearrange(residual, 'b n d -> (b n) d')
396
+ )
397
+
398
+ return gated_output.reshape_as(x)
399
+
400
+
401
+ # token shifting
402
+
403
+ def shift(t, amount, mask=None):
404
+ if amount == 0:
405
+ return t
406
+
407
+ if exists(mask):
408
+ t = t.masked_fill(~mask[..., None], 0.)
409
+
410
+ return F.pad(t, (0, 0, amount, -amount), value=0.)
411
+
412
+
413
+ class ShiftTokens(nn.Module):
414
+ def __init__(self, shifts, fn):
415
+ super().__init__()
416
+ self.fn = fn
417
+ self.shifts = tuple(shifts)
418
+
419
+ def forward(self, x, **kwargs):
420
+ mask = kwargs.get('mask', None)
421
+ shifts = self.shifts
422
+ segments = len(shifts)
423
+ feats_per_shift = x.shape[-1] // segments
424
+ splitted = x.split(feats_per_shift, dim=-1)
425
+ segments_to_shift, rest = splitted[:segments], splitted[segments:]
426
+ segments_to_shift = list(map(lambda args: shift(*args, mask=mask), zip(segments_to_shift, shifts)))
427
+ x = torch.cat((*segments_to_shift, *rest), dim=-1)
428
+ return self.fn(x, **kwargs)
429
+
430
+
431
+ # feedforward
432
+
433
+ class GLU(nn.Module):
434
+ def __init__(self, dim_in, dim_out, activation):
435
+ super().__init__()
436
+ self.act = activation
437
+ self.proj = nn.Linear(dim_in, dim_out * 2)
438
+
439
+ def forward(self, x):
440
+ x, gate = self.proj(x).chunk(2, dim=-1)
441
+ return x * self.act(gate)
442
+
443
+
444
+ class FeedForward(nn.Module):
445
+ def __init__(
446
+ self,
447
+ dim,
448
+ dim_out=None,
449
+ mult=4,
450
+ glu=False,
451
+ relu_squared=False,
452
+ post_act_ln=False,
453
+ dropout=0.,
454
+ zero_init_output=False
455
+ ):
456
+ super().__init__()
457
+ inner_dim = int(dim * mult)
458
+ dim_out = default(dim_out, dim)
459
+ activation = ReluSquared() if relu_squared else nn.GELU()
460
+
461
+ project_in = nn.Sequential(
462
+ nn.Linear(dim, inner_dim),
463
+ activation
464
+ ) if not glu else GLU(dim, inner_dim, activation)
465
+
466
+ self.net = nn.Sequential(
467
+ project_in,
468
+ nn.LayerNorm(inner_dim) if post_act_ln else nn.Identity(),
469
+ nn.Dropout(dropout),
470
+ nn.Linear(inner_dim, dim_out)
471
+ )
472
+
473
+ # init last linear layer to 0
474
+ if zero_init_output:
475
+ init_zero_(self.net[-1])
476
+
477
+ def forward(self, x):
478
+ return self.net(x)
479
+
480
+
481
+ # attention.
482
+
483
+ class Attention(nn.Module):
484
+ def __init__(
485
+ self,
486
+ dim,
487
+ dim_head=DEFAULT_DIM_HEAD,
488
+ heads=8,
489
+ causal=False,
490
+ talking_heads=False,
491
+ head_scale=False,
492
+ collab_heads=False,
493
+ collab_compression=.3,
494
+ sparse_topk=None,
495
+ use_entmax15=False,
496
+ num_mem_kv=0,
497
+ dropout=0.,
498
+ on_attn=False,
499
+ gate_values=False,
500
+ zero_init_output=False,
501
+ max_attend_past=None,
502
+ qk_norm=False,
503
+ scale_init_value=None,
504
+ rel_pos_bias=False,
505
+ rel_pos_num_buckets=32,
506
+ rel_pos_max_distance=128,
507
+ ):
508
+ super().__init__()
509
+ self.scale = dim_head ** -0.5
510
+
511
+ self.heads = heads
512
+ self.causal = causal
513
+ self.max_attend_past = max_attend_past
514
+
515
+ qk_dim = v_dim = dim_head * heads
516
+
517
+ # collaborative heads
518
+ self.collab_heads = collab_heads
519
+ if self.collab_heads:
520
+ qk_dim = int(collab_compression * qk_dim)
521
+ self.collab_mixing = nn.Parameter(torch.randn(heads, qk_dim))
522
+
523
+ self.to_q = nn.Linear(dim, qk_dim, bias=False)
524
+ self.to_k = nn.Linear(dim, qk_dim, bias=False)
525
+ self.to_v = nn.Linear(dim, v_dim, bias=False)
526
+
527
+ self.dropout = nn.Dropout(dropout)
528
+
529
+ # add GLU gating for aggregated values, from alphafold2
530
+ self.to_v_gate = None
531
+ if gate_values:
532
+ self.to_v_gate = nn.Linear(dim, v_dim)
533
+ nn.init.constant_(self.to_v_gate.weight, 0)
534
+ nn.init.constant_(self.to_v_gate.bias, 1)
535
+
536
+ # cosine sim attention
537
+ self.qk_norm = qk_norm
538
+ if qk_norm:
539
+ scale_init_value = default(scale_init_value,
540
+ -3) # if not provided, initialize as though it were sequence length of 1024
541
+ self.scale = nn.Parameter(torch.ones(1, heads, 1, 1) * scale_init_value)
542
+
543
+ # talking heads
544
+ self.talking_heads = talking_heads
545
+ if talking_heads:
546
+ self.pre_softmax_proj = nn.Parameter(torch.randn(heads, heads))
547
+ self.post_softmax_proj = nn.Parameter(torch.randn(heads, heads))
548
+
549
+ # head scaling
550
+ self.head_scale = head_scale
551
+ if head_scale:
552
+ self.head_scale_params = nn.Parameter(torch.ones(1, heads, 1, 1))
553
+
554
+ # explicit topk sparse attention
555
+ self.sparse_topk = sparse_topk
556
+
557
+ # entmax
558
+ self.attn_fn = F.softmax
559
+
560
+ # add memory key / values
561
+ self.num_mem_kv = num_mem_kv
562
+ if num_mem_kv > 0:
563
+ self.mem_k = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head))
564
+ self.mem_v = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head))
565
+
566
+ # attention on attention
567
+ self.attn_on_attn = on_attn
568
+ self.to_out = nn.Sequential(nn.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else nn.Linear(v_dim, dim)
569
+
570
+ self.rel_pos_bias = rel_pos_bias
571
+ if rel_pos_bias:
572
+ assert rel_pos_num_buckets <= rel_pos_max_distance, 'number of relative position buckets must be less than the relative position max distance'
573
+ self.rel_pos = RelativePositionBias(scale=dim_head ** 0.5, causal=causal, heads=heads,
574
+ num_buckets=rel_pos_num_buckets, max_distance=rel_pos_max_distance)
575
+
576
+ # init output projection 0
577
+ if zero_init_output:
578
+ init_zero_(self.to_out)
579
+
580
+ def forward(
581
+ self,
582
+ x,
583
+ context=None,
584
+ mask=None,
585
+ context_mask=None,
586
+ attn_mask=None,
587
+ sinusoidal_emb=None,
588
+ rotary_pos_emb=None,
589
+ prev_attn=None,
590
+ mem=None,
591
+ layer_past=None,
592
+ ):
593
+ b, n, _, h, talking_heads, collab_heads, head_scale, scale, device, has_context = *x.shape, self.heads, self.talking_heads, self.collab_heads, self.head_scale, self.scale, x.device, exists(
594
+ context)
595
+ kv_input = default(context, x)
596
+
597
+ q_input = x
598
+ k_input = kv_input
599
+ v_input = kv_input
600
+
601
+ if exists(mem):
602
+ k_input = torch.cat((mem, k_input), dim=-2)
603
+ v_input = torch.cat((mem, v_input), dim=-2)
604
+
605
+ if exists(sinusoidal_emb):
606
+ # in shortformer, the query would start at a position offset depending on the past cached memory
607
+ offset = k_input.shape[-2] - q_input.shape[-2]
608
+ q_input = q_input + sinusoidal_emb(q_input, offset=offset)
609
+ k_input = k_input + sinusoidal_emb(k_input)
610
+
611
+ q = self.to_q(q_input)
612
+ k = self.to_k(k_input)
613
+ v = self.to_v(v_input)
614
+
615
+ if not collab_heads:
616
+ q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=h), (q, k, v))
617
+ else:
618
+ q = einsum('b i d, h d -> b h i d', q, self.collab_mixing)
619
+ k = rearrange(k, 'b n d -> b () n d')
620
+ v = rearrange(v, 'b n (h d) -> b h n d', h=h)
621
+
622
+ if layer_past is not None:
623
+ past_key, past_value = layer_past
624
+ k = torch.cat([past_key, k], dim=-2)
625
+ v = torch.cat([past_value, v], dim=-2)
626
+ k_cache = k
627
+ v_cache = v
628
+
629
+ if exists(rotary_pos_emb) and not has_context:
630
+ l = rotary_pos_emb.shape[-1]
631
+ (ql, qr), (kl, kr), (vl, vr) = map(lambda t: (t[..., :l], t[..., l:]), (q, k, v))
632
+ ql, kl, vl = map(lambda t: apply_rotary_pos_emb(t, rotary_pos_emb), (ql, kl, vl))
633
+ q, k, v = map(lambda t: torch.cat(t, dim=-1), ((ql, qr), (kl, kr), (vl, vr)))
634
+
635
+ input_mask = None
636
+ if any(map(exists, (mask, context_mask))):
637
+ q_mask = default(mask, lambda: torch.ones((b, n), device=device).bool())
638
+ k_mask = q_mask if not exists(context) else context_mask
639
+ k_mask = default(k_mask, lambda: torch.ones((b, k.shape[-2]), device=device).bool())
640
+ q_mask = rearrange(q_mask, 'b i -> b () i ()')
641
+ k_mask = rearrange(k_mask, 'b j -> b () () j')
642
+ input_mask = q_mask * k_mask
643
+
644
+ if self.num_mem_kv > 0:
645
+ mem_k, mem_v = map(lambda t: repeat(t, 'h n d -> b h n d', b=b), (self.mem_k, self.mem_v))
646
+ k = torch.cat((mem_k, k), dim=-2)
647
+ v = torch.cat((mem_v, v), dim=-2)
648
+ if exists(input_mask):
649
+ input_mask = F.pad(input_mask, (self.num_mem_kv, 0), value=True)
650
+
651
+ if collab_heads:
652
+ k = k.expand(-1, h, -1, -1)
653
+
654
+ if self.qk_norm:
655
+ q, k = map(l2norm, (q, k))
656
+ scale = 1 / (self.scale.exp().clamp(min=1e-2))
657
+
658
+ dots = einsum('b h i d, b h j d -> b h i j', q, k) * scale
659
+ mask_value = max_neg_value(dots)
660
+
661
+ if exists(prev_attn):
662
+ dots = dots + prev_attn
663
+
664
+ pre_softmax_attn = dots.clone()
665
+
666
+ if talking_heads:
667
+ dots = einsum('b h i j, h k -> b k i j', dots, self.pre_softmax_proj).contiguous()
668
+
669
+ if self.rel_pos_bias:
670
+ dots = self.rel_pos(dots)
671
+
672
+ if exists(input_mask):
673
+ dots.masked_fill_(~input_mask, mask_value)
674
+ del input_mask
675
+
676
+ if exists(attn_mask):
677
+ assert 2 <= attn_mask.ndim <= 4, 'attention mask must have greater than 2 dimensions but less than or equal to 4'
678
+ if attn_mask.ndim == 2:
679
+ attn_mask = rearrange(attn_mask, 'i j -> () () i j')
680
+ elif attn_mask.ndim == 3:
681
+ attn_mask = rearrange(attn_mask, 'h i j -> () h i j')
682
+ dots.masked_fill_(~attn_mask, mask_value)
683
+
684
+ if exists(self.max_attend_past):
685
+ i, j = dots.shape[-2:]
686
+ range_q = torch.arange(j - i, j, device=device)
687
+ range_k = torch.arange(j, device=device)
688
+ dist = rearrange(range_q, 'i -> () () i ()') - rearrange(range_k, 'j -> () () () j')
689
+ mask = dist > self.max_attend_past
690
+ dots.masked_fill_(mask, mask_value)
691
+ del mask
692
+
693
+ if self.causal:
694
+ i, j = dots.shape[-2:]
695
+ r = torch.arange(i, device=device)
696
+ mask = rearrange(r, 'i -> () () i ()') < rearrange(r, 'j -> () () () j')
697
+ mask = F.pad(mask, (j - i, 0), value=False)
698
+ dots.masked_fill_(mask, mask_value)
699
+ del mask
700
+
701
+ if exists(self.sparse_topk) and self.sparse_topk < dots.shape[-1]:
702
+ top, _ = dots.topk(self.sparse_topk, dim=-1)
703
+ vk = top[..., -1].unsqueeze(-1).expand_as(dots)
704
+ mask = dots < vk
705
+ dots.masked_fill_(mask, mask_value)
706
+ del mask
707
+
708
+ attn = self.attn_fn(dots, dim=-1)
709
+ post_softmax_attn = attn.clone()
710
+
711
+ attn = self.dropout(attn)
712
+
713
+ if talking_heads:
714
+ attn = einsum('b h i j, h k -> b k i j', attn, self.post_softmax_proj).contiguous()
715
+
716
+ out = einsum('b h i j, b h j d -> b h i d', attn, v)
717
+
718
+ if head_scale:
719
+ out = out * self.head_scale_params
720
+
721
+ out = rearrange(out, 'b h n d -> b n (h d)')
722
+
723
+ if exists(self.to_v_gate):
724
+ gates = self.to_v_gate(x)
725
+ out = out * gates.sigmoid()
726
+
727
+ intermediates = Intermediates(
728
+ pre_softmax_attn=pre_softmax_attn,
729
+ post_softmax_attn=post_softmax_attn
730
+ )
731
+
732
+ return self.to_out(out), intermediates, k_cache, v_cache
733
+
734
+
735
+ class AttentionLayers(nn.Module):
736
+ def __init__(
737
+ self,
738
+ dim,
739
+ depth,
740
+ heads=8,
741
+ causal=False,
742
+ cross_attend=False,
743
+ only_cross=False,
744
+ use_scalenorm=False,
745
+ use_rms_scaleshift_norm=False,
746
+ use_rmsnorm=False,
747
+ use_rezero=False,
748
+ alibi_pos_bias=False,
749
+ alibi_num_heads=None,
750
+ alibi_learned=False,
751
+ position_infused_attn=False,
752
+ rotary_pos_emb=False,
753
+ rotary_emb_dim=None,
754
+ custom_layers=None,
755
+ sandwich_coef=None,
756
+ par_ratio=None,
757
+ residual_attn=False,
758
+ cross_residual_attn=False,
759
+ macaron=False,
760
+ pre_norm=True,
761
+ gate_residual=False,
762
+ scale_residual=False,
763
+ shift_tokens=0,
764
+ sandwich_norm=False,
765
+ use_qk_norm_attn=False,
766
+ qk_norm_attn_seq_len=None,
767
+ zero_init_branch_output=False,
768
+ **kwargs
769
+ ):
770
+ super().__init__()
771
+ ff_kwargs, kwargs = groupby_prefix_and_trim('ff_', kwargs)
772
+ attn_kwargs, _ = groupby_prefix_and_trim('attn_', kwargs)
773
+
774
+ dim_head = attn_kwargs.get('dim_head', DEFAULT_DIM_HEAD)
775
+
776
+ self.dim = dim
777
+ self.depth = depth
778
+ self.layers = nn.ModuleList([])
779
+ self.causal = causal
780
+
781
+ rel_pos_bias = 'rel_pos_bias' in attn_kwargs
782
+ self.has_pos_emb = position_infused_attn or rel_pos_bias or rotary_pos_emb
783
+ self.pia_pos_emb = FixedPositionalEmbedding(dim) if position_infused_attn else None
784
+
785
+ rotary_emb_dim = max(default(rotary_emb_dim, dim_head // 2), 32)
786
+ self.rotary_pos_emb = RotaryEmbedding(rotary_emb_dim) if rotary_pos_emb else None
787
+
788
+ assert not (
789
+ alibi_pos_bias and rel_pos_bias), 'you can only choose Alibi positional bias or T5 relative positional bias, not both'
790
+
791
+ if alibi_pos_bias:
792
+ alibi_num_heads = default(alibi_num_heads, heads)
793
+ assert alibi_num_heads <= heads, 'number of ALiBi heads must be less than the total number of heads'
794
+ alibi_pos_klass = LearnedAlibiPositionalBias if alibi_learned or not causal else AlibiPositionalBias
795
+ self.rel_pos = alibi_pos_klass(heads=alibi_num_heads, bidirectional=not causal)
796
+ else:
797
+ self.rel_pos = None
798
+
799
+ assert not (not pre_norm and sandwich_norm), 'sandwich norm cannot be used when not using prenorm'
800
+ self.pre_norm = pre_norm
801
+ self.sandwich_norm = sandwich_norm
802
+
803
+ self.residual_attn = residual_attn
804
+ self.cross_residual_attn = cross_residual_attn
805
+ self.cross_attend = cross_attend
806
+
807
+ norm_class = ScaleNorm if use_scalenorm else nn.LayerNorm
808
+ norm_class = RMSNorm if use_rmsnorm else norm_class
809
+ norm_class = RMSScaleShiftNorm if use_rms_scaleshift_norm else norm_class
810
+ norm_fn = partial(norm_class, dim)
811
+
812
+ norm_fn = nn.Identity if use_rezero else norm_fn
813
+ branch_fn = Rezero if use_rezero else None
814
+
815
+ if cross_attend and not only_cross:
816
+ default_block = ('a', 'c', 'f')
817
+ elif cross_attend and only_cross:
818
+ default_block = ('c', 'f')
819
+ else:
820
+ default_block = ('a', 'f')
821
+
822
+ if macaron:
823
+ default_block = ('f',) + default_block
824
+
825
+ # qk normalization
826
+
827
+ if use_qk_norm_attn:
828
+ attn_scale_init_value = -math.log(math.log2(qk_norm_attn_seq_len ** 2 - qk_norm_attn_seq_len)) if exists(
829
+ qk_norm_attn_seq_len) else None
830
+ attn_kwargs = {**attn_kwargs, 'qk_norm': True, 'scale_init_value': attn_scale_init_value}
831
+
832
+ # zero init
833
+
834
+ if zero_init_branch_output:
835
+ attn_kwargs = {**attn_kwargs, 'zero_init_output': True}
836
+ ff_kwargs = {**ff_kwargs, 'zero_init_output': True}
837
+
838
+ # calculate layer block order
839
+
840
+ if exists(custom_layers):
841
+ layer_types = custom_layers
842
+ elif exists(par_ratio):
843
+ par_depth = depth * len(default_block)
844
+ assert 1 < par_ratio <= par_depth, 'par ratio out of range'
845
+ default_block = tuple(filter(not_equals('f'), default_block))
846
+ par_attn = par_depth // par_ratio
847
+ depth_cut = par_depth * 2 // 3 # 2 / 3 attention layer cutoff suggested by PAR paper
848
+ par_width = (depth_cut + depth_cut // par_attn) // par_attn
849
+ assert len(default_block) <= par_width, 'default block is too large for par_ratio'
850
+ par_block = default_block + ('f',) * (par_width - len(default_block))
851
+ par_head = par_block * par_attn
852
+ layer_types = par_head + ('f',) * (par_depth - len(par_head))
853
+ elif exists(sandwich_coef):
854
+ assert sandwich_coef > 0 and sandwich_coef <= depth, 'sandwich coefficient should be less than the depth'
855
+ layer_types = ('a',) * sandwich_coef + default_block * (depth - sandwich_coef) + ('f',) * sandwich_coef
856
+ else:
857
+ layer_types = default_block * depth
858
+
859
+ self.layer_types = layer_types
860
+ self.num_attn_layers = len(list(filter(equals('a'), layer_types)))
861
+
862
+ # calculate token shifting
863
+
864
+ shift_tokens = cast_tuple(shift_tokens, len(layer_types))
865
+
866
+ # iterate and construct layers
867
+
868
+ for ind, (layer_type, layer_shift_tokens) in enumerate(zip(self.layer_types, shift_tokens)):
869
+ is_last_layer = ind == (len(self.layer_types) - 1)
870
+
871
+ if layer_type == 'a':
872
+ layer = Attention(dim, heads=heads, causal=causal, **attn_kwargs)
873
+ elif layer_type == 'c':
874
+ layer = Attention(dim, heads=heads, **attn_kwargs)
875
+ elif layer_type == 'f':
876
+ layer = FeedForward(dim, **ff_kwargs)
877
+ layer = layer if not macaron else Scale(0.5, layer)
878
+ else:
879
+ raise Exception(f'invalid layer type {layer_type}')
880
+
881
+ if layer_shift_tokens > 0:
882
+ shift_range_upper = layer_shift_tokens + 1
883
+ shift_range_lower = -layer_shift_tokens if not causal else 0
884
+ layer = ShiftTokens(range(shift_range_lower, shift_range_upper), layer)
885
+
886
+ if exists(branch_fn):
887
+ layer = branch_fn(layer)
888
+
889
+ residual_fn = GRUGating if gate_residual else Residual
890
+ residual = residual_fn(dim, scale_residual=scale_residual)
891
+
892
+ layer_uses_qk_norm = use_qk_norm_attn and layer_type in ('a', 'c')
893
+
894
+ pre_branch_norm = norm_fn() if pre_norm and not layer_uses_qk_norm else None
895
+ post_branch_norm = norm_fn() if sandwich_norm or layer_uses_qk_norm else None
896
+ post_main_norm = norm_fn() if not pre_norm and not is_last_layer else None
897
+
898
+ norms = nn.ModuleList([
899
+ pre_branch_norm,
900
+ post_branch_norm,
901
+ post_main_norm
902
+ ])
903
+
904
+ self.layers.append(nn.ModuleList([
905
+ norms,
906
+ layer,
907
+ residual
908
+ ]))
909
+
910
+ def forward(
911
+ self,
912
+ x,
913
+ context=None,
914
+ full_context=None, # for passing a list of hidden states from an encoder
915
+ mask=None,
916
+ context_mask=None,
917
+ attn_mask=None,
918
+ mems=None,
919
+ return_hiddens=False,
920
+ norm_scale_shift_inp=None,
921
+ past_key_values=None,
922
+ expected_seq_len=None,
923
+ ):
924
+
925
+ assert not (self.cross_attend ^ (exists(context) or exists(
926
+ full_context))), 'context must be passed in if cross_attend is set to True'
927
+ assert context is None or full_context is None, 'only one of full_context or context can be provided'
928
+
929
+ hiddens = []
930
+ intermediates = []
931
+ prev_attn = None
932
+ prev_cross_attn = None
933
+
934
+ mems = mems.copy() if exists(mems) else [None] * self.num_attn_layers
935
+ norm_args = {}
936
+ if exists(norm_scale_shift_inp):
937
+ norm_args['norm_scale_shift_inp'] = norm_scale_shift_inp
938
+
939
+ rotary_pos_emb = None
940
+ if exists(self.rotary_pos_emb):
941
+ if not self.training and self.causal:
942
+ assert expected_seq_len is not None, "To decode a transformer with rotary embeddings, you must specify an `expected_seq_len`"
943
+ elif expected_seq_len is None:
944
+ expected_seq_len = 0
945
+ seq_len = x.shape[1]
946
+ if past_key_values is not None:
947
+ seq_len += past_key_values[0][0].shape[-2]
948
+ max_rotary_emb_length = max(list(map(lambda m: (m.shape[1] if exists(m) else 0) + seq_len, mems)) + [expected_seq_len])
949
+ rotary_pos_emb = self.rotary_pos_emb(max_rotary_emb_length, x.device)
950
+
951
+ present_key_values = []
952
+ cross_attn_count = 0
953
+ for ind, (layer_type, (norm, block, residual_fn)) in enumerate(zip(self.layer_types, self.layers)):
954
+ if layer_type == 'a':
955
+ layer_mem = mems.pop(0) if mems else None
956
+
957
+ residual = x
958
+
959
+ pre_branch_norm, post_branch_norm, post_main_norm = norm
960
+
961
+ if exists(pre_branch_norm):
962
+ x = pre_branch_norm(x, **norm_args)
963
+
964
+ if layer_type == 'a' or layer_type == 'c':
965
+ if past_key_values is not None:
966
+ layer_kv = past_key_values.pop(0)
967
+ layer_past = tuple(s.to(x.device) for s in layer_kv)
968
+ else:
969
+ layer_past = None
970
+
971
+ if layer_type == 'a':
972
+ out, inter, k, v = checkpoint(block, x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb,
973
+ prev_attn, layer_mem, layer_past)
974
+ elif layer_type == 'c':
975
+ if exists(full_context):
976
+ out, inter, k, v = checkpoint(block, x, full_context[cross_attn_count], mask, context_mask, None, None,
977
+ None, prev_attn, None, layer_past)
978
+ else:
979
+ out, inter, k, v = checkpoint(block, x, context, mask, context_mask, None, None, None, prev_attn, None, layer_past)
980
+ elif layer_type == 'f':
981
+ out = checkpoint(block, x)
982
+
983
+ if layer_type == 'a' or layer_type == 'c' and present_key_values is not None:
984
+ present_key_values.append((k.detach(), v.detach()))
985
+
986
+ if exists(post_branch_norm):
987
+ out = post_branch_norm(out, **norm_args)
988
+
989
+ x = residual_fn(out, residual)
990
+
991
+ if layer_type in ('a', 'c'):
992
+ intermediates.append(inter)
993
+
994
+ if layer_type == 'a' and self.residual_attn:
995
+ prev_attn = inter.pre_softmax_attn
996
+ elif layer_type == 'c' and self.cross_residual_attn:
997
+ prev_cross_attn = inter.pre_softmax_attn
998
+
999
+ if exists(post_main_norm):
1000
+ x = post_main_norm(x, **norm_args)
1001
+
1002
+ if layer_type == 'c':
1003
+ cross_attn_count += 1
1004
+
1005
+ if layer_type == 'f':
1006
+ hiddens.append(x)
1007
+
1008
+ if return_hiddens:
1009
+ intermediates = LayerIntermediates(
1010
+ hiddens=hiddens,
1011
+ attn_intermediates=intermediates,
1012
+ past_key_values=present_key_values
1013
+ )
1014
+
1015
+ return x, intermediates
1016
+
1017
+ return x
1018
+
1019
+
1020
+ class Encoder(AttentionLayers):
1021
+ def __init__(self, **kwargs):
1022
+ assert 'causal' not in kwargs, 'cannot set causality on encoder'
1023
+ super().__init__(causal=False, **kwargs)
1024
+
1025
+
1026
+ class Decoder(AttentionLayers):
1027
+ def __init__(self, **kwargs):
1028
+ assert 'causal' not in kwargs, 'cannot set causality on decoder'
1029
+ super().__init__(causal=True, **kwargs)
1030
+
1031
+
1032
+ class CrossAttender(AttentionLayers):
1033
+ def __init__(self, **kwargs):
1034
+ super().__init__(cross_attend=True, only_cross=True, **kwargs)
1035
+
1036
+
1037
+ class ViTransformerWrapper(nn.Module):
1038
+ def __init__(
1039
+ self,
1040
+ *,
1041
+ image_size,
1042
+ patch_size,
1043
+ attn_layers,
1044
+ num_classes=None,
1045
+ dropout=0.,
1046
+ emb_dropout=0.
1047
+ ):
1048
+ super().__init__()
1049
+ assert isinstance(attn_layers, Encoder), 'attention layers must be an Encoder'
1050
+ assert image_size % patch_size == 0, 'image dimensions must be divisible by the patch size'
1051
+ dim = attn_layers.dim
1052
+ num_patches = (image_size // patch_size) ** 2
1053
+ patch_dim = 3 * patch_size ** 2
1054
+
1055
+ self.patch_size = patch_size
1056
+
1057
+ self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
1058
+ self.patch_to_embedding = nn.Linear(patch_dim, dim)
1059
+ self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
1060
+ self.dropout = nn.Dropout(emb_dropout)
1061
+
1062
+ self.attn_layers = attn_layers
1063
+ self.norm = nn.LayerNorm(dim)
1064
+ self.mlp_head = FeedForward(dim, dim_out=num_classes, dropout=dropout) if exists(num_classes) else None
1065
+
1066
+ def forward(
1067
+ self,
1068
+ img,
1069
+ return_embeddings=False
1070
+ ):
1071
+ p = self.patch_size
1072
+
1073
+ x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)
1074
+ x = self.patch_to_embedding(x)
1075
+ b, n, _ = x.shape
1076
+
1077
+ cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)
1078
+ x = torch.cat((cls_tokens, x), dim=1)
1079
+ x = x + self.pos_embedding[:, :(n + 1)]
1080
+ x = self.dropout(x)
1081
+
1082
+ x = self.attn_layers(x)
1083
+ x = self.norm(x)
1084
+
1085
+ if not exists(self.mlp_head) or return_embeddings:
1086
+ return x
1087
+
1088
+ return self.mlp_head(x[:, 0])
1089
+
1090
+
1091
+ class TransformerWrapper(nn.Module):
1092
+ def __init__(
1093
+ self,
1094
+ *,
1095
+ num_tokens,
1096
+ max_seq_len,
1097
+ attn_layers,
1098
+ emb_dim=None,
1099
+ max_mem_len=0.,
1100
+ shift_mem_down=0,
1101
+ emb_dropout=0.,
1102
+ num_memory_tokens=None,
1103
+ tie_embedding=False,
1104
+ use_pos_emb=True
1105
+ ):
1106
+ super().__init__()
1107
+ assert isinstance(attn_layers, AttentionLayers), 'attention layers must be one of Encoder or Decoder'
1108
+
1109
+ dim = attn_layers.dim
1110
+ emb_dim = default(emb_dim, dim)
1111
+
1112
+ self.max_seq_len = max_seq_len
1113
+ self.max_mem_len = max_mem_len
1114
+ self.shift_mem_down = shift_mem_down
1115
+
1116
+ self.token_emb = nn.Embedding(num_tokens, emb_dim)
1117
+ self.pos_emb = AbsolutePositionalEmbedding(emb_dim, max_seq_len) if (
1118
+ use_pos_emb and not attn_layers.has_pos_emb) else always(0)
1119
+ self.emb_dropout = nn.Dropout(emb_dropout)
1120
+
1121
+ self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
1122
+ self.attn_layers = attn_layers
1123
+ self.norm = nn.LayerNorm(dim)
1124
+
1125
+ self.init_()
1126
+
1127
+ self.to_logits = nn.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t()
1128
+
1129
+ # memory tokens (like [cls]) from Memory Transformers paper
1130
+ num_memory_tokens = default(num_memory_tokens, 0)
1131
+ self.num_memory_tokens = num_memory_tokens
1132
+ if num_memory_tokens > 0:
1133
+ self.memory_tokens = nn.Parameter(torch.randn(num_memory_tokens, dim))
1134
+
1135
+ def init_(self):
1136
+ nn.init.kaiming_normal_(self.token_emb.weight)
1137
+
1138
+ def forward(
1139
+ self,
1140
+ x,
1141
+ return_embeddings=False,
1142
+ mask=None,
1143
+ return_hiddens=False,
1144
+ return_attn=False,
1145
+ mems=None,
1146
+ use_cache=False,
1147
+ **kwargs
1148
+ ):
1149
+ b, n, device, num_mem = *x.shape, x.device, self.num_memory_tokens
1150
+ x = self.token_emb(x)
1151
+ x = x + self.pos_emb(x)
1152
+ x = self.emb_dropout(x)
1153
+
1154
+ x = self.project_emb(x)
1155
+
1156
+ if num_mem > 0:
1157
+ mem = repeat(self.memory_tokens, 'n d -> b n d', b=b)
1158
+ x = torch.cat((mem, x), dim=1)
1159
+
1160
+ # auto-handle masking after appending memory tokens
1161
+ if exists(mask):
1162
+ mask = F.pad(mask, (num_mem, 0), value=True)
1163
+
1164
+ if self.shift_mem_down and exists(mems):
1165
+ mems_l, mems_r = mems[:self.shift_mem_down], mems[self.shift_mem_down:]
1166
+ mems = [*mems_r, *mems_l]
1167
+
1168
+ x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs)
1169
+ x = self.norm(x)
1170
+
1171
+ mem, x = x[:, :num_mem], x[:, num_mem:]
1172
+
1173
+ out = self.to_logits(x) if not return_embeddings else x
1174
+
1175
+ if return_hiddens:
1176
+ hiddens = intermediates.hiddens
1177
+ return out, hiddens
1178
+
1179
+ res = [out]
1180
+ if return_attn:
1181
+ attn_maps = list(map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates))
1182
+ res.append(attn_maps)
1183
+ if use_cache:
1184
+ res.append(intermediates.past_key_values)
1185
+
1186
+ if len(res) > 1:
1187
+ return tuple(res)
1188
+ return res[0]
1189
+
1190
+
1191
+ class ContinuousTransformerWrapper(nn.Module):
1192
+ def __init__(
1193
+ self,
1194
+ *,
1195
+ max_seq_len,
1196
+ attn_layers,
1197
+ dim_in=None,
1198
+ dim_out=None,
1199
+ emb_dim=None,
1200
+ emb_dropout=0.,
1201
+ use_pos_emb=True
1202
+ ):
1203
+ super().__init__()
1204
+ assert isinstance(attn_layers, AttentionLayers), 'attention layers must be one of Encoder or Decoder'
1205
+
1206
+ dim = attn_layers.dim
1207
+
1208
+ self.max_seq_len = max_seq_len
1209
+
1210
+ self.pos_emb = AbsolutePositionalEmbedding(dim, max_seq_len) if (
1211
+ use_pos_emb and not attn_layers.has_pos_emb) else always(0)
1212
+ self.emb_dropout = nn.Dropout(emb_dropout)
1213
+
1214
+ self.project_in = nn.Linear(dim_in, dim) if exists(dim_in) else nn.Identity()
1215
+
1216
+ self.attn_layers = attn_layers
1217
+ self.norm = nn.LayerNorm(dim)
1218
+
1219
+ self.project_out = nn.Linear(dim, dim_out) if exists(dim_out) else nn.Identity()
1220
+
1221
+ def forward(
1222
+ self,
1223
+ x,
1224
+ return_embeddings=False,
1225
+ mask=None,
1226
+ return_attn=False,
1227
+ mems=None,
1228
+ use_cache=False,
1229
+ **kwargs
1230
+ ):
1231
+ b, n, _, device = *x.shape, x.device
1232
+
1233
+ x = self.project_in(x)
1234
+ x = x + self.pos_emb(x)
1235
+ x = self.emb_dropout(x)
1236
+
1237
+ x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs)
1238
+ x = self.norm(x)
1239
+
1240
+ out = self.project_out(x) if not return_embeddings else x
1241
+
1242
+ res = [out]
1243
+ if return_attn:
1244
+ attn_maps = list(map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates))
1245
+ res.append(attn_maps)
1246
+ if use_cache:
1247
+ res.append(intermediates.past_key_values)
1248
+
1249
+ if len(res) > 1:
1250
+ return tuple(res)
1251
+ return res[0]
1252
+
tortoise/read.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+
4
+ import torch
5
+ import torchaudio
6
+
7
+ from api import TextToSpeech
8
+ from tortoise.utils.audio import load_audio, get_voices, load_voices
9
+
10
+
11
+ def split_and_recombine_text(texts, desired_length=200, max_len=300):
12
+ # TODO: also split across '!' and '?'. Attempt to keep quotations together.
13
+ texts = [s.strip() + "." for s in texts.split('.')]
14
+
15
+ i = 0
16
+ while i < len(texts):
17
+ ltxt = texts[i]
18
+ if len(ltxt) >= desired_length or i == len(texts)-1:
19
+ i += 1
20
+ continue
21
+ if len(ltxt) + len(texts[i+1]) > max_len:
22
+ i += 1
23
+ continue
24
+ texts[i] = f'{ltxt} {texts[i+1]}'
25
+ texts.pop(i+1)
26
+ return texts
27
+
28
+
29
+ if __name__ == '__main__':
30
+ parser = argparse.ArgumentParser()
31
+ parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="tortoise/data/riding_hood.txt")
32
+ parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
33
+ 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat')
34
+ parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/')
35
+ parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard')
36
+ parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None)
37
+ parser.add_argument('--voice_diversity_intelligibility_slider', type=float,
38
+ help='How to balance vocal diversity with the quality/intelligibility of the spoken text. 0 means highly diverse voice (not recommended), 1 means maximize intellibility',
39
+ default=.5)
40
+ parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
41
+ 'should only be specified if you have custom checkpoints.', default='.models')
42
+ args = parser.parse_args()
43
+ tts = TextToSpeech(models_dir=args.model_dir)
44
+
45
+ outpath = args.output_path
46
+ selected_voices = args.voice.split(',')
47
+ regenerate = args.regenerate
48
+ if regenerate is not None:
49
+ regenerate = [int(e) for e in regenerate.split(',')]
50
+
51
+ for selected_voice in selected_voices:
52
+ voice_outpath = os.path.join(outpath, selected_voice)
53
+ os.makedirs(voice_outpath, exist_ok=True)
54
+
55
+ with open(args.textfile, 'r', encoding='utf-8') as f:
56
+ text = ''.join([l for l in f.readlines()])
57
+ texts = split_and_recombine_text(text)
58
+
59
+ if '&' in selected_voice:
60
+ voice_sel = selected_voice.split('&')
61
+ else:
62
+ voice_sel = [selected_voice]
63
+
64
+ voice_samples, conditioning_latents = load_voices(voice_sel)
65
+ all_parts = []
66
+ for j, text in enumerate(texts):
67
+ if regenerate is not None and j not in regenerate:
68
+ all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000))
69
+ continue
70
+ gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
71
+ preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
72
+ gen = gen.squeeze(0).cpu()
73
+ torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), gen, 24000)
74
+ all_parts.append(gen)
75
+ full_audio = torch.cat(all_parts, dim=-1)
76
+ torchaudio.save(os.path.join(voice_outpath, 'combined.wav'), full_audio, 24000)
77
+
tortoise/utils/__init__.py ADDED
File without changes
tortoise/utils/audio.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from glob import glob
3
+
4
+ import librosa
5
+ import torch
6
+ import torchaudio
7
+ import numpy as np
8
+ from scipy.io.wavfile import read
9
+
10
+ from tortoise.utils.stft import STFT
11
+
12
+
13
+ def load_wav_to_torch(full_path):
14
+ sampling_rate, data = read(full_path)
15
+ if data.dtype == np.int32:
16
+ norm_fix = 2 ** 31
17
+ elif data.dtype == np.int16:
18
+ norm_fix = 2 ** 15
19
+ elif data.dtype == np.float16 or data.dtype == np.float32:
20
+ norm_fix = 1.
21
+ else:
22
+ raise NotImplemented(f"Provided data dtype not supported: {data.dtype}")
23
+ return (torch.FloatTensor(data.astype(np.float32)) / norm_fix, sampling_rate)
24
+
25
+
26
+ def load_audio(audiopath, sampling_rate):
27
+ if audiopath[-4:] == '.wav':
28
+ audio, lsr = load_wav_to_torch(audiopath)
29
+ elif audiopath[-4:] == '.mp3':
30
+ audio, lsr = librosa.load(audiopath, sr=sampling_rate)
31
+ audio = torch.FloatTensor(audio)
32
+
33
+ # Remove any channel data.
34
+ if len(audio.shape) > 1:
35
+ if audio.shape[0] < 5:
36
+ audio = audio[0]
37
+ else:
38
+ assert audio.shape[1] < 5
39
+ audio = audio[:, 0]
40
+
41
+ if lsr != sampling_rate:
42
+ audio = torchaudio.functional.resample(audio, lsr, sampling_rate)
43
+
44
+ # Check some assumptions about audio range. This should be automatically fixed in load_wav_to_torch, but might not be in some edge cases, where we should squawk.
45
+ # '2' is arbitrarily chosen since it seems like audio will often "overdrive" the [-1,1] bounds.
46
+ if torch.any(audio > 2) or not torch.any(audio < 0):
47
+ print(f"Error with {audiopath}. Max={audio.max()} min={audio.min()}")
48
+ audio.clip_(-1, 1)
49
+
50
+ return audio.unsqueeze(0)
51
+
52
+
53
+ TACOTRON_MEL_MAX = 2.3143386840820312
54
+ TACOTRON_MEL_MIN = -11.512925148010254
55
+
56
+
57
+ def denormalize_tacotron_mel(norm_mel):
58
+ return ((norm_mel+1)/2)*(TACOTRON_MEL_MAX-TACOTRON_MEL_MIN)+TACOTRON_MEL_MIN
59
+
60
+
61
+ def normalize_tacotron_mel(mel):
62
+ return 2 * ((mel - TACOTRON_MEL_MIN) / (TACOTRON_MEL_MAX - TACOTRON_MEL_MIN)) - 1
63
+
64
+
65
+ def dynamic_range_compression(x, C=1, clip_val=1e-5):
66
+ """
67
+ PARAMS
68
+ ------
69
+ C: compression factor
70
+ """
71
+ return torch.log(torch.clamp(x, min=clip_val) * C)
72
+
73
+
74
+ def dynamic_range_decompression(x, C=1):
75
+ """
76
+ PARAMS
77
+ ------
78
+ C: compression factor used to compress
79
+ """
80
+ return torch.exp(x) / C
81
+
82
+
83
+ def get_voices():
84
+ subs = os.listdir('tortoise/voices')
85
+ voices = {}
86
+ for sub in subs:
87
+ subj = os.path.join('tortoise/voices', sub)
88
+ if os.path.isdir(subj):
89
+ voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.pth'))
90
+ return voices
91
+
92
+
93
+ def load_voice(voice):
94
+ if voice == 'random':
95
+ return None, None
96
+
97
+ voices = get_voices()
98
+ paths = voices[voice]
99
+ if len(paths) == 1 and paths[0].endswith('.pth'):
100
+ return None, torch.load(paths[0])
101
+ else:
102
+ conds = []
103
+ for cond_path in paths:
104
+ c = load_audio(cond_path, 22050)
105
+ conds.append(c)
106
+ return conds, None
107
+
108
+
109
+ def load_voices(voices):
110
+ latents = []
111
+ clips = []
112
+ for voice in voices:
113
+ if voice == 'random':
114
+ print("Cannot combine a random voice with a non-random voice. Just using a random voice.")
115
+ return None, None
116
+ clip, latent = load_voice(voice)
117
+ if latent is None:
118
+ assert len(latents) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this."
119
+ clips.extend(clip)
120
+ elif voice is None:
121
+ assert len(voices) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this."
122
+ latents.append(latent)
123
+ if len(latents) == 0:
124
+ return clips, None
125
+ else:
126
+ latents = torch.stack(latents, dim=0)
127
+ return None, latents.mean(dim=0)
128
+
129
+
130
+ class TacotronSTFT(torch.nn.Module):
131
+ def __init__(self, filter_length=1024, hop_length=256, win_length=1024,
132
+ n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0,
133
+ mel_fmax=8000.0):
134
+ super(TacotronSTFT, self).__init__()
135
+ self.n_mel_channels = n_mel_channels
136
+ self.sampling_rate = sampling_rate
137
+ self.stft_fn = STFT(filter_length, hop_length, win_length)
138
+ from librosa.filters import mel as librosa_mel_fn
139
+ mel_basis = librosa_mel_fn(
140
+ sampling_rate, filter_length, n_mel_channels, mel_fmin, mel_fmax)
141
+ mel_basis = torch.from_numpy(mel_basis).float()
142
+ self.register_buffer('mel_basis', mel_basis)
143
+
144
+ def spectral_normalize(self, magnitudes):
145
+ output = dynamic_range_compression(magnitudes)
146
+ return output
147
+
148
+ def spectral_de_normalize(self, magnitudes):
149
+ output = dynamic_range_decompression(magnitudes)
150
+ return output
151
+
152
+ def mel_spectrogram(self, y):
153
+ """Computes mel-spectrograms from a batch of waves
154
+ PARAMS
155
+ ------
156
+ y: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1]
157
+
158
+ RETURNS
159
+ -------
160
+ mel_output: torch.FloatTensor of shape (B, n_mel_channels, T)
161
+ """
162
+ assert(torch.min(y.data) >= -10)
163
+ assert(torch.max(y.data) <= 10)
164
+ y = torch.clip(y, min=-1, max=1)
165
+
166
+ magnitudes, phases = self.stft_fn.transform(y)
167
+ magnitudes = magnitudes.data
168
+ mel_output = torch.matmul(self.mel_basis, magnitudes)
169
+ mel_output = self.spectral_normalize(mel_output)
170
+ return mel_output
171
+
172
+
173
+ def wav_to_univnet_mel(wav, do_normalization=False):
174
+ stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000)
175
+ stft = stft.cuda()
176
+ mel = stft.mel_spectrogram(wav)
177
+ if do_normalization:
178
+ mel = normalize_tacotron_mel(mel)
179
+ return mel
tortoise/utils/diffusion.py ADDED
@@ -0,0 +1,1250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This is an almost carbon copy of gaussian_diffusion.py from OpenAI's ImprovedDiffusion repo, which itself:
3
+
4
+ This code started out as a PyTorch port of Ho et al's diffusion models:
5
+ https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py
6
+
7
+ Docstrings have been added, as well as DDIM sampling and a new collection of beta schedules.
8
+ """
9
+
10
+ import enum
11
+ import math
12
+
13
+ import numpy as np
14
+ import torch
15
+ import torch as th
16
+ from tqdm import tqdm
17
+
18
+
19
+ def normal_kl(mean1, logvar1, mean2, logvar2):
20
+ """
21
+ Compute the KL divergence between two gaussians.
22
+
23
+ Shapes are automatically broadcasted, so batches can be compared to
24
+ scalars, among other use cases.
25
+ """
26
+ tensor = None
27
+ for obj in (mean1, logvar1, mean2, logvar2):
28
+ if isinstance(obj, th.Tensor):
29
+ tensor = obj
30
+ break
31
+ assert tensor is not None, "at least one argument must be a Tensor"
32
+
33
+ # Force variances to be Tensors. Broadcasting helps convert scalars to
34
+ # Tensors, but it does not work for th.exp().
35
+ logvar1, logvar2 = [
36
+ x if isinstance(x, th.Tensor) else th.tensor(x).to(tensor)
37
+ for x in (logvar1, logvar2)
38
+ ]
39
+
40
+ return 0.5 * (
41
+ -1.0
42
+ + logvar2
43
+ - logvar1
44
+ + th.exp(logvar1 - logvar2)
45
+ + ((mean1 - mean2) ** 2) * th.exp(-logvar2)
46
+ )
47
+
48
+
49
+ def approx_standard_normal_cdf(x):
50
+ """
51
+ A fast approximation of the cumulative distribution function of the
52
+ standard normal.
53
+ """
54
+ return 0.5 * (1.0 + th.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * th.pow(x, 3))))
55
+
56
+
57
+ def discretized_gaussian_log_likelihood(x, *, means, log_scales):
58
+ """
59
+ Compute the log-likelihood of a Gaussian distribution discretizing to a
60
+ given image.
61
+
62
+ :param x: the target images. It is assumed that this was uint8 values,
63
+ rescaled to the range [-1, 1].
64
+ :param means: the Gaussian mean Tensor.
65
+ :param log_scales: the Gaussian log stddev Tensor.
66
+ :return: a tensor like x of log probabilities (in nats).
67
+ """
68
+ assert x.shape == means.shape == log_scales.shape
69
+ centered_x = x - means
70
+ inv_stdv = th.exp(-log_scales)
71
+ plus_in = inv_stdv * (centered_x + 1.0 / 255.0)
72
+ cdf_plus = approx_standard_normal_cdf(plus_in)
73
+ min_in = inv_stdv * (centered_x - 1.0 / 255.0)
74
+ cdf_min = approx_standard_normal_cdf(min_in)
75
+ log_cdf_plus = th.log(cdf_plus.clamp(min=1e-12))
76
+ log_one_minus_cdf_min = th.log((1.0 - cdf_min).clamp(min=1e-12))
77
+ cdf_delta = cdf_plus - cdf_min
78
+ log_probs = th.where(
79
+ x < -0.999,
80
+ log_cdf_plus,
81
+ th.where(x > 0.999, log_one_minus_cdf_min, th.log(cdf_delta.clamp(min=1e-12))),
82
+ )
83
+ assert log_probs.shape == x.shape
84
+ return log_probs
85
+
86
+
87
+ def mean_flat(tensor):
88
+ """
89
+ Take the mean over all non-batch dimensions.
90
+ """
91
+ return tensor.mean(dim=list(range(1, len(tensor.shape))))
92
+
93
+
94
+ def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
95
+ """
96
+ Get a pre-defined beta schedule for the given name.
97
+
98
+ The beta schedule library consists of beta schedules which remain similar
99
+ in the limit of num_diffusion_timesteps.
100
+ Beta schedules may be added, but should not be removed or changed once
101
+ they are committed to maintain backwards compatibility.
102
+ """
103
+ if schedule_name == "linear":
104
+ # Linear schedule from Ho et al, extended to work for any number of
105
+ # diffusion steps.
106
+ scale = 1000 / num_diffusion_timesteps
107
+ beta_start = scale * 0.0001
108
+ beta_end = scale * 0.02
109
+ return np.linspace(
110
+ beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64
111
+ )
112
+ elif schedule_name == "cosine":
113
+ return betas_for_alpha_bar(
114
+ num_diffusion_timesteps,
115
+ lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
116
+ )
117
+ else:
118
+ raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
119
+
120
+
121
+ def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):
122
+ """
123
+ Create a beta schedule that discretizes the given alpha_t_bar function,
124
+ which defines the cumulative product of (1-beta) over time from t = [0,1].
125
+
126
+ :param num_diffusion_timesteps: the number of betas to produce.
127
+ :param alpha_bar: a lambda that takes an argument t from 0 to 1 and
128
+ produces the cumulative product of (1-beta) up to that
129
+ part of the diffusion process.
130
+ :param max_beta: the maximum beta to use; use values lower than 1 to
131
+ prevent singularities.
132
+ """
133
+ betas = []
134
+ for i in range(num_diffusion_timesteps):
135
+ t1 = i / num_diffusion_timesteps
136
+ t2 = (i + 1) / num_diffusion_timesteps
137
+ betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))
138
+ return np.array(betas)
139
+
140
+
141
+ class ModelMeanType(enum.Enum):
142
+ """
143
+ Which type of output the model predicts.
144
+ """
145
+
146
+ PREVIOUS_X = 'previous_x' # the model predicts x_{t-1}
147
+ START_X = 'start_x' # the model predicts x_0
148
+ EPSILON = 'epsilon' # the model predicts epsilon
149
+
150
+
151
+ class ModelVarType(enum.Enum):
152
+ """
153
+ What is used as the model's output variance.
154
+
155
+ The LEARNED_RANGE option has been added to allow the model to predict
156
+ values between FIXED_SMALL and FIXED_LARGE, making its job easier.
157
+ """
158
+
159
+ LEARNED = 'learned'
160
+ FIXED_SMALL = 'fixed_small'
161
+ FIXED_LARGE = 'fixed_large'
162
+ LEARNED_RANGE = 'learned_range'
163
+
164
+
165
+ class LossType(enum.Enum):
166
+ MSE = 'mse' # use raw MSE loss (and KL when learning variances)
167
+ RESCALED_MSE = 'rescaled_mse' # use raw MSE loss (with RESCALED_KL when learning variances)
168
+ KL = 'kl' # use the variational lower-bound
169
+ RESCALED_KL = 'rescaled_kl' # like KL, but rescale to estimate the full VLB
170
+
171
+ def is_vb(self):
172
+ return self == LossType.KL or self == LossType.RESCALED_KL
173
+
174
+
175
+ class GaussianDiffusion:
176
+ """
177
+ Utilities for training and sampling diffusion models.
178
+
179
+ Ported directly from here, and then adapted over time to further experimentation.
180
+ https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py#L42
181
+
182
+ :param betas: a 1-D numpy array of betas for each diffusion timestep,
183
+ starting at T and going to 1.
184
+ :param model_mean_type: a ModelMeanType determining what the model outputs.
185
+ :param model_var_type: a ModelVarType determining how variance is output.
186
+ :param loss_type: a LossType determining the loss function to use.
187
+ :param rescale_timesteps: if True, pass floating point timesteps into the
188
+ model so that they are always scaled like in the
189
+ original paper (0 to 1000).
190
+ """
191
+
192
+ def __init__(
193
+ self,
194
+ *,
195
+ betas,
196
+ model_mean_type,
197
+ model_var_type,
198
+ loss_type,
199
+ rescale_timesteps=False,
200
+ conditioning_free=False,
201
+ conditioning_free_k=1,
202
+ ramp_conditioning_free=True,
203
+ ):
204
+ self.model_mean_type = ModelMeanType(model_mean_type)
205
+ self.model_var_type = ModelVarType(model_var_type)
206
+ self.loss_type = LossType(loss_type)
207
+ self.rescale_timesteps = rescale_timesteps
208
+ self.conditioning_free = conditioning_free
209
+ self.conditioning_free_k = conditioning_free_k
210
+ self.ramp_conditioning_free = ramp_conditioning_free
211
+
212
+ # Use float64 for accuracy.
213
+ betas = np.array(betas, dtype=np.float64)
214
+ self.betas = betas
215
+ assert len(betas.shape) == 1, "betas must be 1-D"
216
+ assert (betas > 0).all() and (betas <= 1).all()
217
+
218
+ self.num_timesteps = int(betas.shape[0])
219
+
220
+ alphas = 1.0 - betas
221
+ self.alphas_cumprod = np.cumprod(alphas, axis=0)
222
+ self.alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1])
223
+ self.alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0)
224
+ assert self.alphas_cumprod_prev.shape == (self.num_timesteps,)
225
+
226
+ # calculations for diffusion q(x_t | x_{t-1}) and others
227
+ self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)
228
+ self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)
229
+ self.log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod)
230
+ self.sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod)
231
+ self.sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1)
232
+
233
+ # calculations for posterior q(x_{t-1} | x_t, x_0)
234
+ self.posterior_variance = (
235
+ betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
236
+ )
237
+ # log calculation clipped because the posterior variance is 0 at the
238
+ # beginning of the diffusion chain.
239
+ self.posterior_log_variance_clipped = np.log(
240
+ np.append(self.posterior_variance[1], self.posterior_variance[1:])
241
+ )
242
+ self.posterior_mean_coef1 = (
243
+ betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
244
+ )
245
+ self.posterior_mean_coef2 = (
246
+ (1.0 - self.alphas_cumprod_prev)
247
+ * np.sqrt(alphas)
248
+ / (1.0 - self.alphas_cumprod)
249
+ )
250
+
251
+ def q_mean_variance(self, x_start, t):
252
+ """
253
+ Get the distribution q(x_t | x_0).
254
+
255
+ :param x_start: the [N x C x ...] tensor of noiseless inputs.
256
+ :param t: the number of diffusion steps (minus 1). Here, 0 means one step.
257
+ :return: A tuple (mean, variance, log_variance), all of x_start's shape.
258
+ """
259
+ mean = (
260
+ _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
261
+ )
262
+ variance = _extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
263
+ log_variance = _extract_into_tensor(
264
+ self.log_one_minus_alphas_cumprod, t, x_start.shape
265
+ )
266
+ return mean, variance, log_variance
267
+
268
+ def q_sample(self, x_start, t, noise=None):
269
+ """
270
+ Diffuse the data for a given number of diffusion steps.
271
+
272
+ In other words, sample from q(x_t | x_0).
273
+
274
+ :param x_start: the initial data batch.
275
+ :param t: the number of diffusion steps (minus 1). Here, 0 means one step.
276
+ :param noise: if specified, the split-out normal noise.
277
+ :return: A noisy version of x_start.
278
+ """
279
+ if noise is None:
280
+ noise = th.randn_like(x_start)
281
+ assert noise.shape == x_start.shape
282
+ return (
283
+ _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
284
+ + _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape)
285
+ * noise
286
+ )
287
+
288
+ def q_posterior_mean_variance(self, x_start, x_t, t):
289
+ """
290
+ Compute the mean and variance of the diffusion posterior:
291
+
292
+ q(x_{t-1} | x_t, x_0)
293
+
294
+ """
295
+ assert x_start.shape == x_t.shape
296
+ posterior_mean = (
297
+ _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
298
+ + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
299
+ )
300
+ posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)
301
+ posterior_log_variance_clipped = _extract_into_tensor(
302
+ self.posterior_log_variance_clipped, t, x_t.shape
303
+ )
304
+ assert (
305
+ posterior_mean.shape[0]
306
+ == posterior_variance.shape[0]
307
+ == posterior_log_variance_clipped.shape[0]
308
+ == x_start.shape[0]
309
+ )
310
+ return posterior_mean, posterior_variance, posterior_log_variance_clipped
311
+
312
+ def p_mean_variance(
313
+ self, model, x, t, clip_denoised=True, denoised_fn=None, model_kwargs=None
314
+ ):
315
+ """
316
+ Apply the model to get p(x_{t-1} | x_t), as well as a prediction of
317
+ the initial x, x_0.
318
+
319
+ :param model: the model, which takes a signal and a batch of timesteps
320
+ as input.
321
+ :param x: the [N x C x ...] tensor at time t.
322
+ :param t: a 1-D Tensor of timesteps.
323
+ :param clip_denoised: if True, clip the denoised signal into [-1, 1].
324
+ :param denoised_fn: if not None, a function which applies to the
325
+ x_start prediction before it is used to sample. Applies before
326
+ clip_denoised.
327
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
328
+ pass to the model. This can be used for conditioning.
329
+ :return: a dict with the following keys:
330
+ - 'mean': the model mean output.
331
+ - 'variance': the model variance output.
332
+ - 'log_variance': the log of 'variance'.
333
+ - 'pred_xstart': the prediction for x_0.
334
+ """
335
+ if model_kwargs is None:
336
+ model_kwargs = {}
337
+
338
+ B, C = x.shape[:2]
339
+ assert t.shape == (B,)
340
+ model_output = model(x, self._scale_timesteps(t), **model_kwargs)
341
+ if self.conditioning_free:
342
+ model_output_no_conditioning = model(x, self._scale_timesteps(t), conditioning_free=True, **model_kwargs)
343
+
344
+ if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:
345
+ assert model_output.shape == (B, C * 2, *x.shape[2:])
346
+ model_output, model_var_values = th.split(model_output, C, dim=1)
347
+ if self.conditioning_free:
348
+ model_output_no_conditioning, _ = th.split(model_output_no_conditioning, C, dim=1)
349
+ if self.model_var_type == ModelVarType.LEARNED:
350
+ model_log_variance = model_var_values
351
+ model_variance = th.exp(model_log_variance)
352
+ else:
353
+ min_log = _extract_into_tensor(
354
+ self.posterior_log_variance_clipped, t, x.shape
355
+ )
356
+ max_log = _extract_into_tensor(np.log(self.betas), t, x.shape)
357
+ # The model_var_values is [-1, 1] for [min_var, max_var].
358
+ frac = (model_var_values + 1) / 2
359
+ model_log_variance = frac * max_log + (1 - frac) * min_log
360
+ model_variance = th.exp(model_log_variance)
361
+ else:
362
+ model_variance, model_log_variance = {
363
+ # for fixedlarge, we set the initial (log-)variance like so
364
+ # to get a better decoder log likelihood.
365
+ ModelVarType.FIXED_LARGE: (
366
+ np.append(self.posterior_variance[1], self.betas[1:]),
367
+ np.log(np.append(self.posterior_variance[1], self.betas[1:])),
368
+ ),
369
+ ModelVarType.FIXED_SMALL: (
370
+ self.posterior_variance,
371
+ self.posterior_log_variance_clipped,
372
+ ),
373
+ }[self.model_var_type]
374
+ model_variance = _extract_into_tensor(model_variance, t, x.shape)
375
+ model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape)
376
+
377
+ if self.conditioning_free:
378
+ if self.ramp_conditioning_free:
379
+ assert t.shape[0] == 1 # This should only be used in inference.
380
+ cfk = self.conditioning_free_k * (1 - self._scale_timesteps(t)[0].item() / self.num_timesteps)
381
+ else:
382
+ cfk = self.conditioning_free_k
383
+ model_output = (1 + cfk) * model_output - cfk * model_output_no_conditioning
384
+
385
+ def process_xstart(x):
386
+ if denoised_fn is not None:
387
+ x = denoised_fn(x)
388
+ if clip_denoised:
389
+ return x.clamp(-1, 1)
390
+ return x
391
+
392
+ if self.model_mean_type == ModelMeanType.PREVIOUS_X:
393
+ pred_xstart = process_xstart(
394
+ self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output)
395
+ )
396
+ model_mean = model_output
397
+ elif self.model_mean_type in [ModelMeanType.START_X, ModelMeanType.EPSILON]:
398
+ if self.model_mean_type == ModelMeanType.START_X:
399
+ pred_xstart = process_xstart(model_output)
400
+ else:
401
+ pred_xstart = process_xstart(
402
+ self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output)
403
+ )
404
+ model_mean, _, _ = self.q_posterior_mean_variance(
405
+ x_start=pred_xstart, x_t=x, t=t
406
+ )
407
+ else:
408
+ raise NotImplementedError(self.model_mean_type)
409
+
410
+ assert (
411
+ model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shape
412
+ )
413
+ return {
414
+ "mean": model_mean,
415
+ "variance": model_variance,
416
+ "log_variance": model_log_variance,
417
+ "pred_xstart": pred_xstart,
418
+ }
419
+
420
+ def _predict_xstart_from_eps(self, x_t, t, eps):
421
+ assert x_t.shape == eps.shape
422
+ return (
423
+ _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
424
+ - _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps
425
+ )
426
+
427
+ def _predict_xstart_from_xprev(self, x_t, t, xprev):
428
+ assert x_t.shape == xprev.shape
429
+ return ( # (xprev - coef2*x_t) / coef1
430
+ _extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * xprev
431
+ - _extract_into_tensor(
432
+ self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape
433
+ )
434
+ * x_t
435
+ )
436
+
437
+ def _predict_eps_from_xstart(self, x_t, t, pred_xstart):
438
+ return (
439
+ _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
440
+ - pred_xstart
441
+ ) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape)
442
+
443
+ def _scale_timesteps(self, t):
444
+ if self.rescale_timesteps:
445
+ return t.float() * (1000.0 / self.num_timesteps)
446
+ return t
447
+
448
+ def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
449
+ """
450
+ Compute the mean for the previous step, given a function cond_fn that
451
+ computes the gradient of a conditional log probability with respect to
452
+ x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
453
+ condition on y.
454
+
455
+ This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
456
+ """
457
+ gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs)
458
+ new_mean = (
459
+ p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
460
+ )
461
+ return new_mean
462
+
463
+ def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
464
+ """
465
+ Compute what the p_mean_variance output would have been, should the
466
+ model's score function be conditioned by cond_fn.
467
+
468
+ See condition_mean() for details on cond_fn.
469
+
470
+ Unlike condition_mean(), this instead uses the conditioning strategy
471
+ from Song et al (2020).
472
+ """
473
+ alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
474
+
475
+ eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
476
+ eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
477
+ x, self._scale_timesteps(t), **model_kwargs
478
+ )
479
+
480
+ out = p_mean_var.copy()
481
+ out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
482
+ out["mean"], _, _ = self.q_posterior_mean_variance(
483
+ x_start=out["pred_xstart"], x_t=x, t=t
484
+ )
485
+ return out
486
+
487
+ def p_sample(
488
+ self,
489
+ model,
490
+ x,
491
+ t,
492
+ clip_denoised=True,
493
+ denoised_fn=None,
494
+ cond_fn=None,
495
+ model_kwargs=None,
496
+ ):
497
+ """
498
+ Sample x_{t-1} from the model at the given timestep.
499
+
500
+ :param model: the model to sample from.
501
+ :param x: the current tensor at x_{t-1}.
502
+ :param t: the value of t, starting at 0 for the first diffusion step.
503
+ :param clip_denoised: if True, clip the x_start prediction to [-1, 1].
504
+ :param denoised_fn: if not None, a function which applies to the
505
+ x_start prediction before it is used to sample.
506
+ :param cond_fn: if not None, this is a gradient function that acts
507
+ similarly to the model.
508
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
509
+ pass to the model. This can be used for conditioning.
510
+ :return: a dict containing the following keys:
511
+ - 'sample': a random sample from the model.
512
+ - 'pred_xstart': a prediction of x_0.
513
+ """
514
+ out = self.p_mean_variance(
515
+ model,
516
+ x,
517
+ t,
518
+ clip_denoised=clip_denoised,
519
+ denoised_fn=denoised_fn,
520
+ model_kwargs=model_kwargs,
521
+ )
522
+ noise = th.randn_like(x)
523
+ nonzero_mask = (
524
+ (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
525
+ ) # no noise when t == 0
526
+ if cond_fn is not None:
527
+ out["mean"] = self.condition_mean(
528
+ cond_fn, out, x, t, model_kwargs=model_kwargs
529
+ )
530
+ sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
531
+ return {"sample": sample, "pred_xstart": out["pred_xstart"]}
532
+
533
+ def p_sample_loop(
534
+ self,
535
+ model,
536
+ shape,
537
+ noise=None,
538
+ clip_denoised=True,
539
+ denoised_fn=None,
540
+ cond_fn=None,
541
+ model_kwargs=None,
542
+ device=None,
543
+ progress=False,
544
+ ):
545
+ """
546
+ Generate samples from the model.
547
+
548
+ :param model: the model module.
549
+ :param shape: the shape of the samples, (N, C, H, W).
550
+ :param noise: if specified, the noise from the encoder to sample.
551
+ Should be of the same shape as `shape`.
552
+ :param clip_denoised: if True, clip x_start predictions to [-1, 1].
553
+ :param denoised_fn: if not None, a function which applies to the
554
+ x_start prediction before it is used to sample.
555
+ :param cond_fn: if not None, this is a gradient function that acts
556
+ similarly to the model.
557
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
558
+ pass to the model. This can be used for conditioning.
559
+ :param device: if specified, the device to create the samples on.
560
+ If not specified, use a model parameter's device.
561
+ :param progress: if True, show a tqdm progress bar.
562
+ :return: a non-differentiable batch of samples.
563
+ """
564
+ final = None
565
+ for sample in self.p_sample_loop_progressive(
566
+ model,
567
+ shape,
568
+ noise=noise,
569
+ clip_denoised=clip_denoised,
570
+ denoised_fn=denoised_fn,
571
+ cond_fn=cond_fn,
572
+ model_kwargs=model_kwargs,
573
+ device=device,
574
+ progress=progress,
575
+ ):
576
+ final = sample
577
+ return final["sample"]
578
+
579
+ def p_sample_loop_progressive(
580
+ self,
581
+ model,
582
+ shape,
583
+ noise=None,
584
+ clip_denoised=True,
585
+ denoised_fn=None,
586
+ cond_fn=None,
587
+ model_kwargs=None,
588
+ device=None,
589
+ progress=False,
590
+ ):
591
+ """
592
+ Generate samples from the model and yield intermediate samples from
593
+ each timestep of diffusion.
594
+
595
+ Arguments are the same as p_sample_loop().
596
+ Returns a generator over dicts, where each dict is the return value of
597
+ p_sample().
598
+ """
599
+ if device is None:
600
+ device = next(model.parameters()).device
601
+ assert isinstance(shape, (tuple, list))
602
+ if noise is not None:
603
+ img = noise
604
+ else:
605
+ img = th.randn(*shape, device=device)
606
+ indices = list(range(self.num_timesteps))[::-1]
607
+
608
+ for i in tqdm(indices, disable=not progress):
609
+ t = th.tensor([i] * shape[0], device=device)
610
+ with th.no_grad():
611
+ out = self.p_sample(
612
+ model,
613
+ img,
614
+ t,
615
+ clip_denoised=clip_denoised,
616
+ denoised_fn=denoised_fn,
617
+ cond_fn=cond_fn,
618
+ model_kwargs=model_kwargs,
619
+ )
620
+ yield out
621
+ img = out["sample"]
622
+
623
+ def ddim_sample(
624
+ self,
625
+ model,
626
+ x,
627
+ t,
628
+ clip_denoised=True,
629
+ denoised_fn=None,
630
+ cond_fn=None,
631
+ model_kwargs=None,
632
+ eta=0.0,
633
+ ):
634
+ """
635
+ Sample x_{t-1} from the model using DDIM.
636
+
637
+ Same usage as p_sample().
638
+ """
639
+ out = self.p_mean_variance(
640
+ model,
641
+ x,
642
+ t,
643
+ clip_denoised=clip_denoised,
644
+ denoised_fn=denoised_fn,
645
+ model_kwargs=model_kwargs,
646
+ )
647
+ if cond_fn is not None:
648
+ out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)
649
+
650
+ # Usually our model outputs epsilon, but we re-derive it
651
+ # in case we used x_start or x_prev prediction.
652
+ eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
653
+
654
+ alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
655
+ alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
656
+ sigma = (
657
+ eta
658
+ * th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar))
659
+ * th.sqrt(1 - alpha_bar / alpha_bar_prev)
660
+ )
661
+ # Equation 12.
662
+ noise = th.randn_like(x)
663
+ mean_pred = (
664
+ out["pred_xstart"] * th.sqrt(alpha_bar_prev)
665
+ + th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
666
+ )
667
+ nonzero_mask = (
668
+ (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
669
+ ) # no noise when t == 0
670
+ sample = mean_pred + nonzero_mask * sigma * noise
671
+ return {"sample": sample, "pred_xstart": out["pred_xstart"]}
672
+
673
+ def ddim_reverse_sample(
674
+ self,
675
+ model,
676
+ x,
677
+ t,
678
+ clip_denoised=True,
679
+ denoised_fn=None,
680
+ model_kwargs=None,
681
+ eta=0.0,
682
+ ):
683
+ """
684
+ Sample x_{t+1} from the model using DDIM reverse ODE.
685
+ """
686
+ assert eta == 0.0, "Reverse ODE only for deterministic path"
687
+ out = self.p_mean_variance(
688
+ model,
689
+ x,
690
+ t,
691
+ clip_denoised=clip_denoised,
692
+ denoised_fn=denoised_fn,
693
+ model_kwargs=model_kwargs,
694
+ )
695
+ # Usually our model outputs epsilon, but we re-derive it
696
+ # in case we used x_start or x_prev prediction.
697
+ eps = (
698
+ _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x.shape) * x
699
+ - out["pred_xstart"]
700
+ ) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x.shape)
701
+ alpha_bar_next = _extract_into_tensor(self.alphas_cumprod_next, t, x.shape)
702
+
703
+ # Equation 12. reversed
704
+ mean_pred = (
705
+ out["pred_xstart"] * th.sqrt(alpha_bar_next)
706
+ + th.sqrt(1 - alpha_bar_next) * eps
707
+ )
708
+
709
+ return {"sample": mean_pred, "pred_xstart": out["pred_xstart"]}
710
+
711
+ def ddim_sample_loop(
712
+ self,
713
+ model,
714
+ shape,
715
+ noise=None,
716
+ clip_denoised=True,
717
+ denoised_fn=None,
718
+ cond_fn=None,
719
+ model_kwargs=None,
720
+ device=None,
721
+ progress=False,
722
+ eta=0.0,
723
+ ):
724
+ """
725
+ Generate samples from the model using DDIM.
726
+
727
+ Same usage as p_sample_loop().
728
+ """
729
+ final = None
730
+ for sample in self.ddim_sample_loop_progressive(
731
+ model,
732
+ shape,
733
+ noise=noise,
734
+ clip_denoised=clip_denoised,
735
+ denoised_fn=denoised_fn,
736
+ cond_fn=cond_fn,
737
+ model_kwargs=model_kwargs,
738
+ device=device,
739
+ progress=progress,
740
+ eta=eta,
741
+ ):
742
+ final = sample
743
+ return final["sample"]
744
+
745
+ def ddim_sample_loop_progressive(
746
+ self,
747
+ model,
748
+ shape,
749
+ noise=None,
750
+ clip_denoised=True,
751
+ denoised_fn=None,
752
+ cond_fn=None,
753
+ model_kwargs=None,
754
+ device=None,
755
+ progress=False,
756
+ eta=0.0,
757
+ ):
758
+ """
759
+ Use DDIM to sample from the model and yield intermediate samples from
760
+ each timestep of DDIM.
761
+
762
+ Same usage as p_sample_loop_progressive().
763
+ """
764
+ if device is None:
765
+ device = next(model.parameters()).device
766
+ assert isinstance(shape, (tuple, list))
767
+ if noise is not None:
768
+ img = noise
769
+ else:
770
+ img = th.randn(*shape, device=device)
771
+ indices = list(range(self.num_timesteps))[::-1]
772
+
773
+ if progress:
774
+ # Lazy import so that we don't depend on tqdm.
775
+ from tqdm.auto import tqdm
776
+
777
+ indices = tqdm(indices, disable=not progress)
778
+
779
+ for i in indices:
780
+ t = th.tensor([i] * shape[0], device=device)
781
+ with th.no_grad():
782
+ out = self.ddim_sample(
783
+ model,
784
+ img,
785
+ t,
786
+ clip_denoised=clip_denoised,
787
+ denoised_fn=denoised_fn,
788
+ cond_fn=cond_fn,
789
+ model_kwargs=model_kwargs,
790
+ eta=eta,
791
+ )
792
+ yield out
793
+ img = out["sample"]
794
+
795
+ def _vb_terms_bpd(
796
+ self, model, x_start, x_t, t, clip_denoised=True, model_kwargs=None
797
+ ):
798
+ """
799
+ Get a term for the variational lower-bound.
800
+
801
+ The resulting units are bits (rather than nats, as one might expect).
802
+ This allows for comparison to other papers.
803
+
804
+ :return: a dict with the following keys:
805
+ - 'output': a shape [N] tensor of NLLs or KLs.
806
+ - 'pred_xstart': the x_0 predictions.
807
+ """
808
+ true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance(
809
+ x_start=x_start, x_t=x_t, t=t
810
+ )
811
+ out = self.p_mean_variance(
812
+ model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs
813
+ )
814
+ kl = normal_kl(
815
+ true_mean, true_log_variance_clipped, out["mean"], out["log_variance"]
816
+ )
817
+ kl = mean_flat(kl) / np.log(2.0)
818
+
819
+ decoder_nll = -discretized_gaussian_log_likelihood(
820
+ x_start, means=out["mean"], log_scales=0.5 * out["log_variance"]
821
+ )
822
+ assert decoder_nll.shape == x_start.shape
823
+ decoder_nll = mean_flat(decoder_nll) / np.log(2.0)
824
+
825
+ # At the first timestep return the decoder NLL,
826
+ # otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))
827
+ output = th.where((t == 0), decoder_nll, kl)
828
+ return {"output": output, "pred_xstart": out["pred_xstart"]}
829
+
830
+ def training_losses(self, model, x_start, t, model_kwargs=None, noise=None):
831
+ """
832
+ Compute training losses for a single timestep.
833
+
834
+ :param model: the model to evaluate loss on.
835
+ :param x_start: the [N x C x ...] tensor of inputs.
836
+ :param t: a batch of timestep indices.
837
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
838
+ pass to the model. This can be used for conditioning.
839
+ :param noise: if specified, the specific Gaussian noise to try to remove.
840
+ :return: a dict with the key "loss" containing a tensor of shape [N].
841
+ Some mean or variance settings may also have other keys.
842
+ """
843
+ if model_kwargs is None:
844
+ model_kwargs = {}
845
+ if noise is None:
846
+ noise = th.randn_like(x_start)
847
+ x_t = self.q_sample(x_start, t, noise=noise)
848
+
849
+ terms = {}
850
+
851
+ if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:
852
+ # TODO: support multiple model outputs for this mode.
853
+ terms["loss"] = self._vb_terms_bpd(
854
+ model=model,
855
+ x_start=x_start,
856
+ x_t=x_t,
857
+ t=t,
858
+ clip_denoised=False,
859
+ model_kwargs=model_kwargs,
860
+ )["output"]
861
+ if self.loss_type == LossType.RESCALED_KL:
862
+ terms["loss"] *= self.num_timesteps
863
+ elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
864
+ model_outputs = model(x_t, self._scale_timesteps(t), **model_kwargs)
865
+ if isinstance(model_outputs, tuple):
866
+ model_output = model_outputs[0]
867
+ terms['extra_outputs'] = model_outputs[1:]
868
+ else:
869
+ model_output = model_outputs
870
+
871
+ if self.model_var_type in [
872
+ ModelVarType.LEARNED,
873
+ ModelVarType.LEARNED_RANGE,
874
+ ]:
875
+ B, C = x_t.shape[:2]
876
+ assert model_output.shape == (B, C * 2, *x_t.shape[2:])
877
+ model_output, model_var_values = th.split(model_output, C, dim=1)
878
+ # Learn the variance using the variational bound, but don't let
879
+ # it affect our mean prediction.
880
+ frozen_out = th.cat([model_output.detach(), model_var_values], dim=1)
881
+ terms["vb"] = self._vb_terms_bpd(
882
+ model=lambda *args, r=frozen_out: r,
883
+ x_start=x_start,
884
+ x_t=x_t,
885
+ t=t,
886
+ clip_denoised=False,
887
+ )["output"]
888
+ if self.loss_type == LossType.RESCALED_MSE:
889
+ # Divide by 1000 for equivalence with initial implementation.
890
+ # Without a factor of 1/1000, the VB term hurts the MSE term.
891
+ terms["vb"] *= self.num_timesteps / 1000.0
892
+
893
+ if self.model_mean_type == ModelMeanType.PREVIOUS_X:
894
+ target = self.q_posterior_mean_variance(
895
+ x_start=x_start, x_t=x_t, t=t
896
+ )[0]
897
+ x_start_pred = torch.zeros(x_start) # Not supported.
898
+ elif self.model_mean_type == ModelMeanType.START_X:
899
+ target = x_start
900
+ x_start_pred = model_output
901
+ elif self.model_mean_type == ModelMeanType.EPSILON:
902
+ target = noise
903
+ x_start_pred = self._predict_xstart_from_eps(x_t, t, model_output)
904
+ else:
905
+ raise NotImplementedError(self.model_mean_type)
906
+ assert model_output.shape == target.shape == x_start.shape
907
+ terms["mse"] = mean_flat((target - model_output) ** 2)
908
+ terms["x_start_predicted"] = x_start_pred
909
+ if "vb" in terms:
910
+ terms["loss"] = terms["mse"] + terms["vb"]
911
+ else:
912
+ terms["loss"] = terms["mse"]
913
+ else:
914
+ raise NotImplementedError(self.loss_type)
915
+
916
+ return terms
917
+
918
+ def autoregressive_training_losses(self, model, x_start, t, model_output_keys, gd_out_key, model_kwargs=None, noise=None):
919
+ """
920
+ Compute training losses for a single timestep.
921
+
922
+ :param model: the model to evaluate loss on.
923
+ :param x_start: the [N x C x ...] tensor of inputs.
924
+ :param t: a batch of timestep indices.
925
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
926
+ pass to the model. This can be used for conditioning.
927
+ :param noise: if specified, the specific Gaussian noise to try to remove.
928
+ :return: a dict with the key "loss" containing a tensor of shape [N].
929
+ Some mean or variance settings may also have other keys.
930
+ """
931
+ if model_kwargs is None:
932
+ model_kwargs = {}
933
+ if noise is None:
934
+ noise = th.randn_like(x_start)
935
+ x_t = self.q_sample(x_start, t, noise=noise)
936
+ terms = {}
937
+ if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:
938
+ assert False # not currently supported for this type of diffusion.
939
+ elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
940
+ model_outputs = model(x_t, x_start, self._scale_timesteps(t), **model_kwargs)
941
+ terms.update({k: o for k, o in zip(model_output_keys, model_outputs)})
942
+ model_output = terms[gd_out_key]
943
+ if self.model_var_type in [
944
+ ModelVarType.LEARNED,
945
+ ModelVarType.LEARNED_RANGE,
946
+ ]:
947
+ B, C = x_t.shape[:2]
948
+ assert model_output.shape == (B, C, 2, *x_t.shape[2:])
949
+ model_output, model_var_values = model_output[:, :, 0], model_output[:, :, 1]
950
+ # Learn the variance using the variational bound, but don't let
951
+ # it affect our mean prediction.
952
+ frozen_out = th.cat([model_output.detach(), model_var_values], dim=1)
953
+ terms["vb"] = self._vb_terms_bpd(
954
+ model=lambda *args, r=frozen_out: r,
955
+ x_start=x_start,
956
+ x_t=x_t,
957
+ t=t,
958
+ clip_denoised=False,
959
+ )["output"]
960
+ if self.loss_type == LossType.RESCALED_MSE:
961
+ # Divide by 1000 for equivalence with initial implementation.
962
+ # Without a factor of 1/1000, the VB term hurts the MSE term.
963
+ terms["vb"] *= self.num_timesteps / 1000.0
964
+
965
+ if self.model_mean_type == ModelMeanType.PREVIOUS_X:
966
+ target = self.q_posterior_mean_variance(
967
+ x_start=x_start, x_t=x_t, t=t
968
+ )[0]
969
+ x_start_pred = torch.zeros(x_start) # Not supported.
970
+ elif self.model_mean_type == ModelMeanType.START_X:
971
+ target = x_start
972
+ x_start_pred = model_output
973
+ elif self.model_mean_type == ModelMeanType.EPSILON:
974
+ target = noise
975
+ x_start_pred = self._predict_xstart_from_eps(x_t, t, model_output)
976
+ else:
977
+ raise NotImplementedError(self.model_mean_type)
978
+ assert model_output.shape == target.shape == x_start.shape
979
+ terms["mse"] = mean_flat((target - model_output) ** 2)
980
+ terms["x_start_predicted"] = x_start_pred
981
+ if "vb" in terms:
982
+ terms["loss"] = terms["mse"] + terms["vb"]
983
+ else:
984
+ terms["loss"] = terms["mse"]
985
+ else:
986
+ raise NotImplementedError(self.loss_type)
987
+
988
+ return terms
989
+
990
+ def _prior_bpd(self, x_start):
991
+ """
992
+ Get the prior KL term for the variational lower-bound, measured in
993
+ bits-per-dim.
994
+
995
+ This term can't be optimized, as it only depends on the encoder.
996
+
997
+ :param x_start: the [N x C x ...] tensor of inputs.
998
+ :return: a batch of [N] KL values (in bits), one per batch element.
999
+ """
1000
+ batch_size = x_start.shape[0]
1001
+ t = th.tensor([self.num_timesteps - 1] * batch_size, device=x_start.device)
1002
+ qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t)
1003
+ kl_prior = normal_kl(
1004
+ mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0
1005
+ )
1006
+ return mean_flat(kl_prior) / np.log(2.0)
1007
+
1008
+ def calc_bpd_loop(self, model, x_start, clip_denoised=True, model_kwargs=None):
1009
+ """
1010
+ Compute the entire variational lower-bound, measured in bits-per-dim,
1011
+ as well as other related quantities.
1012
+
1013
+ :param model: the model to evaluate loss on.
1014
+ :param x_start: the [N x C x ...] tensor of inputs.
1015
+ :param clip_denoised: if True, clip denoised samples.
1016
+ :param model_kwargs: if not None, a dict of extra keyword arguments to
1017
+ pass to the model. This can be used for conditioning.
1018
+
1019
+ :return: a dict containing the following keys:
1020
+ - total_bpd: the total variational lower-bound, per batch element.
1021
+ - prior_bpd: the prior term in the lower-bound.
1022
+ - vb: an [N x T] tensor of terms in the lower-bound.
1023
+ - xstart_mse: an [N x T] tensor of x_0 MSEs for each timestep.
1024
+ - mse: an [N x T] tensor of epsilon MSEs for each timestep.
1025
+ """
1026
+ device = x_start.device
1027
+ batch_size = x_start.shape[0]
1028
+
1029
+ vb = []
1030
+ xstart_mse = []
1031
+ mse = []
1032
+ for t in list(range(self.num_timesteps))[::-1]:
1033
+ t_batch = th.tensor([t] * batch_size, device=device)
1034
+ noise = th.randn_like(x_start)
1035
+ x_t = self.q_sample(x_start=x_start, t=t_batch, noise=noise)
1036
+ # Calculate VLB term at the current timestep
1037
+ with th.no_grad():
1038
+ out = self._vb_terms_bpd(
1039
+ model,
1040
+ x_start=x_start,
1041
+ x_t=x_t,
1042
+ t=t_batch,
1043
+ clip_denoised=clip_denoised,
1044
+ model_kwargs=model_kwargs,
1045
+ )
1046
+ vb.append(out["output"])
1047
+ xstart_mse.append(mean_flat((out["pred_xstart"] - x_start) ** 2))
1048
+ eps = self._predict_eps_from_xstart(x_t, t_batch, out["pred_xstart"])
1049
+ mse.append(mean_flat((eps - noise) ** 2))
1050
+
1051
+ vb = th.stack(vb, dim=1)
1052
+ xstart_mse = th.stack(xstart_mse, dim=1)
1053
+ mse = th.stack(mse, dim=1)
1054
+
1055
+ prior_bpd = self._prior_bpd(x_start)
1056
+ total_bpd = vb.sum(dim=1) + prior_bpd
1057
+ return {
1058
+ "total_bpd": total_bpd,
1059
+ "prior_bpd": prior_bpd,
1060
+ "vb": vb,
1061
+ "xstart_mse": xstart_mse,
1062
+ "mse": mse,
1063
+ }
1064
+
1065
+
1066
+ def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
1067
+ """
1068
+ Get a pre-defined beta schedule for the given name.
1069
+
1070
+ The beta schedule library consists of beta schedules which remain similar
1071
+ in the limit of num_diffusion_timesteps.
1072
+ Beta schedules may be added, but should not be removed or changed once
1073
+ they are committed to maintain backwards compatibility.
1074
+ """
1075
+ if schedule_name == "linear":
1076
+ # Linear schedule from Ho et al, extended to work for any number of
1077
+ # diffusion steps.
1078
+ scale = 1000 / num_diffusion_timesteps
1079
+ beta_start = scale * 0.0001
1080
+ beta_end = scale * 0.02
1081
+ return np.linspace(
1082
+ beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64
1083
+ )
1084
+ elif schedule_name == "cosine":
1085
+ return betas_for_alpha_bar(
1086
+ num_diffusion_timesteps,
1087
+ lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
1088
+ )
1089
+ else:
1090
+ raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
1091
+
1092
+
1093
+ class SpacedDiffusion(GaussianDiffusion):
1094
+ """
1095
+ A diffusion process which can skip steps in a base diffusion process.
1096
+
1097
+ :param use_timesteps: a collection (sequence or set) of timesteps from the
1098
+ original diffusion process to retain.
1099
+ :param kwargs: the kwargs to create the base diffusion process.
1100
+ """
1101
+
1102
+ def __init__(self, use_timesteps, **kwargs):
1103
+ self.use_timesteps = set(use_timesteps)
1104
+ self.timestep_map = []
1105
+ self.original_num_steps = len(kwargs["betas"])
1106
+
1107
+ base_diffusion = GaussianDiffusion(**kwargs) # pylint: disable=missing-kwoa
1108
+ last_alpha_cumprod = 1.0
1109
+ new_betas = []
1110
+ for i, alpha_cumprod in enumerate(base_diffusion.alphas_cumprod):
1111
+ if i in self.use_timesteps:
1112
+ new_betas.append(1 - alpha_cumprod / last_alpha_cumprod)
1113
+ last_alpha_cumprod = alpha_cumprod
1114
+ self.timestep_map.append(i)
1115
+ kwargs["betas"] = np.array(new_betas)
1116
+ super().__init__(**kwargs)
1117
+
1118
+ def p_mean_variance(
1119
+ self, model, *args, **kwargs
1120
+ ): # pylint: disable=signature-differs
1121
+ return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
1122
+
1123
+ def training_losses(
1124
+ self, model, *args, **kwargs
1125
+ ): # pylint: disable=signature-differs
1126
+ return super().training_losses(self._wrap_model(model), *args, **kwargs)
1127
+
1128
+ def autoregressive_training_losses(
1129
+ self, model, *args, **kwargs
1130
+ ): # pylint: disable=signature-differs
1131
+ return super().autoregressive_training_losses(self._wrap_model(model, True), *args, **kwargs)
1132
+
1133
+ def condition_mean(self, cond_fn, *args, **kwargs):
1134
+ return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
1135
+
1136
+ def condition_score(self, cond_fn, *args, **kwargs):
1137
+ return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
1138
+
1139
+ def _wrap_model(self, model, autoregressive=False):
1140
+ if isinstance(model, _WrappedModel) or isinstance(model, _WrappedAutoregressiveModel):
1141
+ return model
1142
+ mod = _WrappedAutoregressiveModel if autoregressive else _WrappedModel
1143
+ return mod(
1144
+ model, self.timestep_map, self.rescale_timesteps, self.original_num_steps
1145
+ )
1146
+
1147
+ def _scale_timesteps(self, t):
1148
+ # Scaling is done by the wrapped model.
1149
+ return t
1150
+
1151
+
1152
+ def space_timesteps(num_timesteps, section_counts):
1153
+ """
1154
+ Create a list of timesteps to use from an original diffusion process,
1155
+ given the number of timesteps we want to take from equally-sized portions
1156
+ of the original process.
1157
+
1158
+ For example, if there's 300 timesteps and the section counts are [10,15,20]
1159
+ then the first 100 timesteps are strided to be 10 timesteps, the second 100
1160
+ are strided to be 15 timesteps, and the final 100 are strided to be 20.
1161
+
1162
+ If the stride is a string starting with "ddim", then the fixed striding
1163
+ from the DDIM paper is used, and only one section is allowed.
1164
+
1165
+ :param num_timesteps: the number of diffusion steps in the original
1166
+ process to divide up.
1167
+ :param section_counts: either a list of numbers, or a string containing
1168
+ comma-separated numbers, indicating the step count
1169
+ per section. As a special case, use "ddimN" where N
1170
+ is a number of steps to use the striding from the
1171
+ DDIM paper.
1172
+ :return: a set of diffusion steps from the original process to use.
1173
+ """
1174
+ if isinstance(section_counts, str):
1175
+ if section_counts.startswith("ddim"):
1176
+ desired_count = int(section_counts[len("ddim") :])
1177
+ for i in range(1, num_timesteps):
1178
+ if len(range(0, num_timesteps, i)) == desired_count:
1179
+ return set(range(0, num_timesteps, i))
1180
+ raise ValueError(
1181
+ f"cannot create exactly {num_timesteps} steps with an integer stride"
1182
+ )
1183
+ section_counts = [int(x) for x in section_counts.split(",")]
1184
+ size_per = num_timesteps // len(section_counts)
1185
+ extra = num_timesteps % len(section_counts)
1186
+ start_idx = 0
1187
+ all_steps = []
1188
+ for i, section_count in enumerate(section_counts):
1189
+ size = size_per + (1 if i < extra else 0)
1190
+ if size < section_count:
1191
+ raise ValueError(
1192
+ f"cannot divide section of {size} steps into {section_count}"
1193
+ )
1194
+ if section_count <= 1:
1195
+ frac_stride = 1
1196
+ else:
1197
+ frac_stride = (size - 1) / (section_count - 1)
1198
+ cur_idx = 0.0
1199
+ taken_steps = []
1200
+ for _ in range(section_count):
1201
+ taken_steps.append(start_idx + round(cur_idx))
1202
+ cur_idx += frac_stride
1203
+ all_steps += taken_steps
1204
+ start_idx += size
1205
+ return set(all_steps)
1206
+
1207
+
1208
+ class _WrappedModel:
1209
+ def __init__(self, model, timestep_map, rescale_timesteps, original_num_steps):
1210
+ self.model = model
1211
+ self.timestep_map = timestep_map
1212
+ self.rescale_timesteps = rescale_timesteps
1213
+ self.original_num_steps = original_num_steps
1214
+
1215
+ def __call__(self, x, ts, **kwargs):
1216
+ map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype)
1217
+ new_ts = map_tensor[ts]
1218
+ if self.rescale_timesteps:
1219
+ new_ts = new_ts.float() * (1000.0 / self.original_num_steps)
1220
+ return self.model(x, new_ts, **kwargs)
1221
+
1222
+
1223
+ class _WrappedAutoregressiveModel:
1224
+ def __init__(self, model, timestep_map, rescale_timesteps, original_num_steps):
1225
+ self.model = model
1226
+ self.timestep_map = timestep_map
1227
+ self.rescale_timesteps = rescale_timesteps
1228
+ self.original_num_steps = original_num_steps
1229
+
1230
+ def __call__(self, x, x0, ts, **kwargs):
1231
+ map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype)
1232
+ new_ts = map_tensor[ts]
1233
+ if self.rescale_timesteps:
1234
+ new_ts = new_ts.float() * (1000.0 / self.original_num_steps)
1235
+ return self.model(x, x0, new_ts, **kwargs)
1236
+
1237
+ def _extract_into_tensor(arr, timesteps, broadcast_shape):
1238
+ """
1239
+ Extract values from a 1-D numpy array for a batch of indices.
1240
+
1241
+ :param arr: the 1-D numpy array.
1242
+ :param timesteps: a tensor of indices into the array to extract.
1243
+ :param broadcast_shape: a larger shape of K dimensions with the batch
1244
+ dimension equal to the length of timesteps.
1245
+ :return: a tensor of shape [batch_size, 1, ...] where the shape has K dims.
1246
+ """
1247
+ res = th.from_numpy(arr).to(device=timesteps.device)[timesteps].float()
1248
+ while len(res.shape) < len(broadcast_shape):
1249
+ res = res[..., None]
1250
+ return res.expand(broadcast_shape)
tortoise/utils/samples_generator.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ # This script builds the sample webpage.
4
+
5
+ if __name__ == '__main__':
6
+ result = "<html><head><title>These words were never spoken.</title></head><body><h1>Handpicked results</h1>"
7
+ for fv in os.listdir('../../results/favorites'):
8
+ url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/favorites/{fv}'
9
+ result = result + f'<audio controls="" style="width: 600px;"><source src="{url}" type="audio/mp3"></audio><br>\n'
10
+
11
+ result = result + "<h1>Handpicked longform result:<h1>"
12
+ url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/favorite_riding_hood.mp3'
13
+ result = result + f'<audio controls="" style="width: 600px;"><source src="{url}" type="audio/mp3"></audio><br>\n'
14
+
15
+ result = result + "<h1>Compared to Tacotron2 (with the LJSpeech voice):</h1><table><th>Tacotron2+Waveglow</th><th>TorToiSe</th>"
16
+ for k in range(2,5,1):
17
+ url1 = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/{k}-tacotron2.mp3'
18
+ url2 = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/tacotron_comparison/{k}-tortoise.mp3'
19
+ result = result + f'<tr><td><audio controls="" style="width: 300px;"><source src="{url1}" type="audio/mp3"></audio><br>\n</td>' \
20
+ f'<td><audio controls="" style="width: 300px;"><source src="{url2}" type="audio/mp3"></audio><br>\n</td></tr>'
21
+ result = result + "</table>"
22
+
23
+ result = result + "<h1>Various spoken texts for all voices:<h1>"
24
+ voices = ['angie', 'daniel', 'deniro', 'emma', 'freeman', 'geralt', 'halle', 'jlaw', 'lj', 'myself',
25
+ 'pat', 'snakes', 'tom', 'train_atkins', 'train_dotrice', 'train_kennard', 'weaver', 'william']
26
+ lines = ['<table><th>text</th>' + ''.join([f'<th>{v}</th>' for v in voices])]
27
+ line = f'<tr><td>reference clip</td>'
28
+ for v in voices:
29
+ url = f'https://github.com/neonbjb/tortoise-tts/raw/main/voices/{v}/1.wav'
30
+ line = line + f'<td><audio controls="" style="width: 150px;"><source src="{url}" type="audio/mp3"></audio></td>'
31
+ line = line + "</tr>"
32
+ lines.append(line)
33
+ for txt in os.listdir('../../results/various/'):
34
+ if 'desktop' in txt:
35
+ continue
36
+ line = f'<tr><td>{txt}</td>'
37
+ for v in voices:
38
+ url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/various/{txt}/{v}.mp3'
39
+ line = line + f'<td><audio controls="" style="width: 150px;"><source src="{url}" type="audio/mp3"></audio></td>'
40
+ line = line + "</tr>"
41
+ lines.append(line)
42
+ result = result + '\n'.join(lines) + "</table>"
43
+
44
+ result = result + "<h1>Longform result for all voices:</h1>"
45
+ for lf in os.listdir('../../results/riding_hood'):
46
+ url = f'https://github.com/neonbjb/tortoise-tts/raw/main/results/riding_hood/{lf}'
47
+ result = result + f'<audio controls="" style="width: 600px;"><source src="{url}" type="audio/mp3"></audio><br>\n'
48
+
49
+ result = result + "</body></html>"
50
+ with open('result.html', 'w', encoding='utf-8') as f:
51
+ f.write(result)
tortoise/utils/stft.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BSD 3-Clause License
3
+
4
+ Copyright (c) 2017, Prem Seetharaman
5
+ All rights reserved.
6
+
7
+ * Redistribution and use in source and binary forms, with or without
8
+ modification, are permitted provided that the following conditions are met:
9
+
10
+ * Redistributions of source code must retain the above copyright notice,
11
+ this list of conditions and the following disclaimer.
12
+
13
+ * Redistributions in binary form must reproduce the above copyright notice, this
14
+ list of conditions and the following disclaimer in the
15
+ documentation and/or other materials provided with the distribution.
16
+
17
+ * Neither the name of the copyright holder nor the names of its
18
+ contributors may be used to endorse or promote products derived from this
19
+ software without specific prior written permission.
20
+
21
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
22
+ ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
23
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
24
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
25
+ ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
26
+ (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
27
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
28
+ ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
29
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
30
+ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
31
+ """
32
+
33
+ import torch
34
+ import numpy as np
35
+ import torch.nn.functional as F
36
+ from torch.autograd import Variable
37
+ from scipy.signal import get_window
38
+ from librosa.util import pad_center, tiny
39
+ import librosa.util as librosa_util
40
+
41
+
42
+ def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
43
+ n_fft=800, dtype=np.float32, norm=None):
44
+ """
45
+ # from librosa 0.6
46
+ Compute the sum-square envelope of a window function at a given hop length.
47
+
48
+ This is used to estimate modulation effects induced by windowing
49
+ observations in short-time fourier transforms.
50
+
51
+ Parameters
52
+ ----------
53
+ window : string, tuple, number, callable, or list-like
54
+ Window specification, as in `get_window`
55
+
56
+ n_frames : int > 0
57
+ The number of analysis frames
58
+
59
+ hop_length : int > 0
60
+ The number of samples to advance between frames
61
+
62
+ win_length : [optional]
63
+ The length of the window function. By default, this matches `n_fft`.
64
+
65
+ n_fft : int > 0
66
+ The length of each analysis frame.
67
+
68
+ dtype : np.dtype
69
+ The data type of the output
70
+
71
+ Returns
72
+ -------
73
+ wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
74
+ The sum-squared envelope of the window function
75
+ """
76
+ if win_length is None:
77
+ win_length = n_fft
78
+
79
+ n = n_fft + hop_length * (n_frames - 1)
80
+ x = np.zeros(n, dtype=dtype)
81
+
82
+ # Compute the squared window at the desired length
83
+ win_sq = get_window(window, win_length, fftbins=True)
84
+ win_sq = librosa_util.normalize(win_sq, norm=norm)**2
85
+ win_sq = librosa_util.pad_center(win_sq, n_fft)
86
+
87
+ # Fill the envelope
88
+ for i in range(n_frames):
89
+ sample = i * hop_length
90
+ x[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))]
91
+ return x
92
+
93
+
94
+ class STFT(torch.nn.Module):
95
+ """adapted from Prem Seetharaman's https://github.com/pseeth/pytorch-stft"""
96
+ def __init__(self, filter_length=800, hop_length=200, win_length=800,
97
+ window='hann'):
98
+ super(STFT, self).__init__()
99
+ self.filter_length = filter_length
100
+ self.hop_length = hop_length
101
+ self.win_length = win_length
102
+ self.window = window
103
+ self.forward_transform = None
104
+ scale = self.filter_length / self.hop_length
105
+ fourier_basis = np.fft.fft(np.eye(self.filter_length))
106
+
107
+ cutoff = int((self.filter_length / 2 + 1))
108
+ fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]),
109
+ np.imag(fourier_basis[:cutoff, :])])
110
+
111
+ forward_basis = torch.FloatTensor(fourier_basis[:, None, :])
112
+ inverse_basis = torch.FloatTensor(
113
+ np.linalg.pinv(scale * fourier_basis).T[:, None, :])
114
+
115
+ if window is not None:
116
+ assert(filter_length >= win_length)
117
+ # get window and zero center pad it to filter_length
118
+ fft_window = get_window(window, win_length, fftbins=True)
119
+ fft_window = pad_center(fft_window, size=filter_length)
120
+ fft_window = torch.from_numpy(fft_window).float()
121
+
122
+ # window the bases
123
+ forward_basis *= fft_window
124
+ inverse_basis *= fft_window
125
+
126
+ self.register_buffer('forward_basis', forward_basis.float())
127
+ self.register_buffer('inverse_basis', inverse_basis.float())
128
+
129
+ def transform(self, input_data):
130
+ num_batches = input_data.size(0)
131
+ num_samples = input_data.size(1)
132
+
133
+ self.num_samples = num_samples
134
+
135
+ # similar to librosa, reflect-pad the input
136
+ input_data = input_data.view(num_batches, 1, num_samples)
137
+ input_data = F.pad(
138
+ input_data.unsqueeze(1),
139
+ (int(self.filter_length / 2), int(self.filter_length / 2), 0, 0),
140
+ mode='reflect')
141
+ input_data = input_data.squeeze(1)
142
+
143
+ forward_transform = F.conv1d(
144
+ input_data,
145
+ Variable(self.forward_basis, requires_grad=False),
146
+ stride=self.hop_length,
147
+ padding=0)
148
+
149
+ cutoff = int((self.filter_length / 2) + 1)
150
+ real_part = forward_transform[:, :cutoff, :]
151
+ imag_part = forward_transform[:, cutoff:, :]
152
+
153
+ magnitude = torch.sqrt(real_part**2 + imag_part**2)
154
+ phase = torch.autograd.Variable(
155
+ torch.atan2(imag_part.data, real_part.data))
156
+
157
+ return magnitude, phase
158
+
159
+ def inverse(self, magnitude, phase):
160
+ recombine_magnitude_phase = torch.cat(
161
+ [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1)
162
+
163
+ inverse_transform = F.conv_transpose1d(
164
+ recombine_magnitude_phase,
165
+ Variable(self.inverse_basis, requires_grad=False),
166
+ stride=self.hop_length,
167
+ padding=0)
168
+
169
+ if self.window is not None:
170
+ window_sum = window_sumsquare(
171
+ self.window, magnitude.size(-1), hop_length=self.hop_length,
172
+ win_length=self.win_length, n_fft=self.filter_length,
173
+ dtype=np.float32)
174
+ # remove modulation effects
175
+ approx_nonzero_indices = torch.from_numpy(
176
+ np.where(window_sum > tiny(window_sum))[0])
177
+ window_sum = torch.autograd.Variable(
178
+ torch.from_numpy(window_sum), requires_grad=False)
179
+ window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum
180
+ inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices]
181
+
182
+ # scale by hop ratio
183
+ inverse_transform *= float(self.filter_length) / self.hop_length
184
+
185
+ inverse_transform = inverse_transform[:, :, int(self.filter_length/2):]
186
+ inverse_transform = inverse_transform[:, :, :-int(self.filter_length/2):]
187
+
188
+ return inverse_transform
189
+
190
+ def forward(self, input_data):
191
+ self.magnitude, self.phase = self.transform(input_data)
192
+ reconstruction = self.inverse(self.magnitude, self.phase)
193
+ return reconstruction
tortoise/utils/tokenizer.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ import inflect
4
+ import torch
5
+ from tokenizers import Tokenizer
6
+
7
+
8
+ # Regular expression matching whitespace:
9
+ from unidecode import unidecode
10
+
11
+ _whitespace_re = re.compile(r'\s+')
12
+
13
+
14
+ # List of (regular expression, replacement) pairs for abbreviations:
15
+ _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
16
+ ('mrs', 'misess'),
17
+ ('mr', 'mister'),
18
+ ('dr', 'doctor'),
19
+ ('st', 'saint'),
20
+ ('co', 'company'),
21
+ ('jr', 'junior'),
22
+ ('maj', 'major'),
23
+ ('gen', 'general'),
24
+ ('drs', 'doctors'),
25
+ ('rev', 'reverend'),
26
+ ('lt', 'lieutenant'),
27
+ ('hon', 'honorable'),
28
+ ('sgt', 'sergeant'),
29
+ ('capt', 'captain'),
30
+ ('esq', 'esquire'),
31
+ ('ltd', 'limited'),
32
+ ('col', 'colonel'),
33
+ ('ft', 'fort'),
34
+ ]]
35
+
36
+
37
+ def expand_abbreviations(text):
38
+ for regex, replacement in _abbreviations:
39
+ text = re.sub(regex, replacement, text)
40
+ return text
41
+
42
+
43
+ _inflect = inflect.engine()
44
+ _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
45
+ _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
46
+ _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
47
+ _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
48
+ _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
49
+ _number_re = re.compile(r'[0-9]+')
50
+
51
+
52
+ def _remove_commas(m):
53
+ return m.group(1).replace(',', '')
54
+
55
+
56
+ def _expand_decimal_point(m):
57
+ return m.group(1).replace('.', ' point ')
58
+
59
+
60
+ def _expand_dollars(m):
61
+ match = m.group(1)
62
+ parts = match.split('.')
63
+ if len(parts) > 2:
64
+ return match + ' dollars' # Unexpected format
65
+ dollars = int(parts[0]) if parts[0] else 0
66
+ cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
67
+ if dollars and cents:
68
+ dollar_unit = 'dollar' if dollars == 1 else 'dollars'
69
+ cent_unit = 'cent' if cents == 1 else 'cents'
70
+ return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
71
+ elif dollars:
72
+ dollar_unit = 'dollar' if dollars == 1 else 'dollars'
73
+ return '%s %s' % (dollars, dollar_unit)
74
+ elif cents:
75
+ cent_unit = 'cent' if cents == 1 else 'cents'
76
+ return '%s %s' % (cents, cent_unit)
77
+ else:
78
+ return 'zero dollars'
79
+
80
+
81
+ def _expand_ordinal(m):
82
+ return _inflect.number_to_words(m.group(0))
83
+
84
+
85
+ def _expand_number(m):
86
+ num = int(m.group(0))
87
+ if num > 1000 and num < 3000:
88
+ if num == 2000:
89
+ return 'two thousand'
90
+ elif num > 2000 and num < 2010:
91
+ return 'two thousand ' + _inflect.number_to_words(num % 100)
92
+ elif num % 100 == 0:
93
+ return _inflect.number_to_words(num // 100) + ' hundred'
94
+ else:
95
+ return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
96
+ else:
97
+ return _inflect.number_to_words(num, andword='')
98
+
99
+
100
+ def normalize_numbers(text):
101
+ text = re.sub(_comma_number_re, _remove_commas, text)
102
+ text = re.sub(_pounds_re, r'\1 pounds', text)
103
+ text = re.sub(_dollars_re, _expand_dollars, text)
104
+ text = re.sub(_decimal_number_re, _expand_decimal_point, text)
105
+ text = re.sub(_ordinal_re, _expand_ordinal, text)
106
+ text = re.sub(_number_re, _expand_number, text)
107
+ return text
108
+
109
+
110
+ def expand_numbers(text):
111
+ return normalize_numbers(text)
112
+
113
+
114
+ def lowercase(text):
115
+ return text.lower()
116
+
117
+
118
+ def collapse_whitespace(text):
119
+ return re.sub(_whitespace_re, ' ', text)
120
+
121
+
122
+ def convert_to_ascii(text):
123
+ return unidecode(text)
124
+
125
+
126
+ def basic_cleaners(text):
127
+ '''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
128
+ text = lowercase(text)
129
+ text = collapse_whitespace(text)
130
+ return text
131
+
132
+
133
+ def transliteration_cleaners(text):
134
+ '''Pipeline for non-English text that transliterates to ASCII.'''
135
+ text = convert_to_ascii(text)
136
+ text = lowercase(text)
137
+ text = collapse_whitespace(text)
138
+ return text
139
+
140
+
141
+ def english_cleaners(text):
142
+ '''Pipeline for English text, including number and abbreviation expansion.'''
143
+ text = convert_to_ascii(text)
144
+ text = lowercase(text)
145
+ text = expand_numbers(text)
146
+ text = expand_abbreviations(text)
147
+ text = collapse_whitespace(text)
148
+ text = text.replace('"', '')
149
+ return text
150
+
151
+ def lev_distance(s1, s2):
152
+ if len(s1) > len(s2):
153
+ s1, s2 = s2, s1
154
+
155
+ distances = range(len(s1) + 1)
156
+ for i2, c2 in enumerate(s2):
157
+ distances_ = [i2 + 1]
158
+ for i1, c1 in enumerate(s1):
159
+ if c1 == c2:
160
+ distances_.append(distances[i1])
161
+ else:
162
+ distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
163
+ distances = distances_
164
+ return distances[-1]
165
+
166
+ class VoiceBpeTokenizer:
167
+ def __init__(self, vocab_file='tortoise/data/tokenizer.json'):
168
+ if vocab_file is not None:
169
+ self.tokenizer = Tokenizer.from_file(vocab_file)
170
+
171
+ def preprocess_text(self, txt):
172
+ txt = english_cleaners(txt)
173
+ return txt
174
+
175
+ def encode(self, txt):
176
+ txt = self.preprocess_text(txt)
177
+ txt = txt.replace(' ', '[SPACE]')
178
+ return self.tokenizer.encode(txt).ids
179
+
180
+ def decode(self, seq):
181
+ if isinstance(seq, torch.Tensor):
182
+ seq = seq.cpu().numpy()
183
+ txt = self.tokenizer.decode(seq, skip_special_tokens=False).replace(' ', '')
184
+ txt = txt.replace('[SPACE]', ' ')
185
+ txt = txt.replace('[STOP]', '')
186
+ txt = txt.replace('[UNK]', '')
187
+ return txt
tortoise/utils/typical_sampling.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import LogitsWarper
3
+
4
+
5
+ class TypicalLogitsWarper(LogitsWarper):
6
+ def __init__(self, mass: float = 0.9, filter_value: float = -float("Inf"), min_tokens_to_keep: int = 1):
7
+ self.filter_value = filter_value
8
+ self.mass = mass
9
+ self.min_tokens_to_keep = min_tokens_to_keep
10
+
11
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
12
+ # calculate entropy
13
+ normalized = torch.nn.functional.log_softmax(scores, dim=-1)
14
+ p = torch.exp(normalized)
15
+ ent = -(normalized * p).nansum(-1, keepdim=True)
16
+
17
+ # shift and sort
18
+ shifted_scores = torch.abs((-normalized) - ent)
19
+ sorted_scores, sorted_indices = torch.sort(shifted_scores, descending=False)
20
+ sorted_logits = scores.gather(-1, sorted_indices)
21
+ cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)
22
+
23
+ # Remove tokens with cumulative mass above the threshold
24
+ last_ind = (cumulative_probs < self.mass).sum(dim=1)
25
+ last_ind[last_ind < 0] = 0
26
+ sorted_indices_to_remove = sorted_scores > sorted_scores.gather(1, last_ind.view(-1, 1))
27
+ if self.min_tokens_to_keep > 1:
28
+ # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
29
+ sorted_indices_to_remove[..., : self.min_tokens_to_keep] = 0
30
+ indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
31
+
32
+ scores = scores.masked_fill(indices_to_remove, self.filter_value)
33
+ return scores
tortoise/utils/wav2vec_alignment.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ import torch
4
+ import torchaudio
5
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2Processor
6
+
7
+ from tortoise.utils.audio import load_audio
8
+
9
+
10
+ def max_alignment(s1, s2, skip_character='~', record={}):
11
+ """
12
+ A clever function that aligns s1 to s2 as best it can. Wherever a character from s1 is not found in s2, a '~' is
13
+ used to replace that character.
14
+
15
+ Finally got to use my DP skills!
16
+ """
17
+ assert skip_character not in s1, f"Found the skip character {skip_character} in the provided string, {s1}"
18
+ if len(s1) == 0:
19
+ return ''
20
+ if len(s2) == 0:
21
+ return skip_character * len(s1)
22
+ if s1 == s2:
23
+ return s1
24
+ if s1[0] == s2[0]:
25
+ return s1[0] + max_alignment(s1[1:], s2[1:], skip_character, record)
26
+
27
+ take_s1_key = (len(s1), len(s2) - 1)
28
+ if take_s1_key in record:
29
+ take_s1, take_s1_score = record[take_s1_key]
30
+ else:
31
+ take_s1 = max_alignment(s1, s2[1:], skip_character, record)
32
+ take_s1_score = len(take_s1.replace(skip_character, ''))
33
+ record[take_s1_key] = (take_s1, take_s1_score)
34
+
35
+ take_s2_key = (len(s1) - 1, len(s2))
36
+ if take_s2_key in record:
37
+ take_s2, take_s2_score = record[take_s2_key]
38
+ else:
39
+ take_s2 = max_alignment(s1[1:], s2, skip_character, record)
40
+ take_s2_score = len(take_s2.replace(skip_character, ''))
41
+ record[take_s2_key] = (take_s2, take_s2_score)
42
+
43
+ return take_s1 if take_s1_score > take_s2_score else skip_character + take_s2
44
+
45
+
46
+ class Wav2VecAlignment:
47
+ """
48
+ Uses wav2vec2 to perform audio<->text alignment.
49
+ """
50
+ def __init__(self):
51
+ self.model = Wav2Vec2ForCTC.from_pretrained("jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli").cpu()
52
+ self.feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(f"facebook/wav2vec2-large-960h")
53
+ self.tokenizer = Wav2Vec2CTCTokenizer.from_pretrained('jbetker/tacotron_symbols')
54
+
55
+ def align(self, audio, expected_text, audio_sample_rate=24000):
56
+ orig_len = audio.shape[-1]
57
+
58
+ with torch.no_grad():
59
+ self.model = self.model.cuda()
60
+ audio = audio.to('cuda')
61
+ audio = torchaudio.functional.resample(audio, audio_sample_rate, 16000)
62
+ clip_norm = (audio - audio.mean()) / torch.sqrt(audio.var() + 1e-7)
63
+ logits = self.model(clip_norm).logits
64
+ self.model = self.model.cpu()
65
+
66
+ logits = logits[0]
67
+ pred_string = self.tokenizer.decode(logits.argmax(-1).tolist())
68
+
69
+ fixed_expectation = max_alignment(expected_text, pred_string)
70
+ w2v_compression = orig_len // logits.shape[0]
71
+ expected_tokens = self.tokenizer.encode(fixed_expectation)
72
+ expected_chars = list(fixed_expectation)
73
+ if len(expected_tokens) == 1:
74
+ return [0] # The alignment is simple; there is only one token.
75
+ expected_tokens.pop(0) # The first token is a given.
76
+ expected_chars.pop(0)
77
+
78
+ alignments = [0]
79
+ def pop_till_you_win():
80
+ if len(expected_tokens) == 0:
81
+ return None
82
+ popped = expected_tokens.pop(0)
83
+ popped_char = expected_chars.pop(0)
84
+ while popped_char == '~':
85
+ alignments.append(-1)
86
+ if len(expected_tokens) == 0:
87
+ return None
88
+ popped = expected_tokens.pop(0)
89
+ popped_char = expected_chars.pop(0)
90
+ return popped
91
+
92
+ next_expected_token = pop_till_you_win()
93
+ for i, logit in enumerate(logits):
94
+ top = logit.argmax()
95
+ if next_expected_token == top:
96
+ alignments.append(i * w2v_compression)
97
+ if len(expected_tokens) > 0:
98
+ next_expected_token = pop_till_you_win()
99
+ else:
100
+ break
101
+
102
+ pop_till_you_win()
103
+ assert len(expected_tokens) == 0, "This shouldn't happen. My coding sucks."
104
+
105
+ # Now fix up alignments. Anything with -1 should be interpolated.
106
+ alignments.append(orig_len) # This'll get removed but makes the algorithm below more readable.
107
+ for i in range(len(alignments)):
108
+ if alignments[i] == -1:
109
+ for j in range(i+1, len(alignments)):
110
+ if alignments[j] != -1:
111
+ next_found_token = j
112
+ break
113
+ for j in range(i, next_found_token):
114
+ gap = alignments[next_found_token] - alignments[i-1]
115
+ alignments[j] = (j-i+1) * gap // (next_found_token-i+1) + alignments[i-1]
116
+
117
+ return alignments[:-1]
118
+
119
+ def redact(self, audio, expected_text, audio_sample_rate=24000):
120
+ if '[' not in expected_text:
121
+ return audio
122
+ splitted = expected_text.split('[')
123
+ fully_split = [splitted[0]]
124
+ for spl in splitted[1:]:
125
+ assert ']' in spl, 'Every "[" character must be paired with a "]" with no nesting.'
126
+ fully_split.extend(spl.split(']'))
127
+
128
+ # At this point, fully_split is a list of strings, with every other string being something that should be redacted.
129
+ non_redacted_intervals = []
130
+ last_point = 0
131
+ for i in range(len(fully_split)):
132
+ if i % 2 == 0:
133
+ end_interval = max(0, last_point + len(fully_split[i]) - 1)
134
+ non_redacted_intervals.append((last_point, end_interval))
135
+ last_point += len(fully_split[i])
136
+
137
+ bare_text = ''.join(fully_split)
138
+ alignments = self.align(audio, bare_text, audio_sample_rate)
139
+
140
+ output_audio = []
141
+ for nri in non_redacted_intervals:
142
+ start, stop = nri
143
+ output_audio.append(audio[:, alignments[start]:alignments[stop]])
144
+ return torch.cat(output_audio, dim=-1)
145
+
tortoise_tts.ipynb CHANGED
@@ -40,7 +40,8 @@
40
  "source": [
41
  "!git clone https://github.com/neonbjb/tortoise-tts.git\n",
42
  "%cd tortoise-tts\n",
43
- "!pip install -r requirements.txt"
 
44
  ]
45
  },
46
  {
@@ -54,8 +55,8 @@
54
  "\n",
55
  "import IPython\n",
56
  "\n",
57
- "from api import TextToSpeech\n",
58
- "from utils.audio import load_audio, get_voices\n",
59
  "\n",
60
  "# This will download all the models used by Tortoise from the HF hub.\n",
61
  "tts = TextToSpeech()"
@@ -66,20 +67,6 @@
66
  "execution_count": null,
67
  "outputs": []
68
  },
69
- {
70
- "cell_type": "code",
71
- "source": [
72
- "# List all the voices available. These are just some random clips I've gathered\n",
73
- "# from the internet as well as a few voices from the training dataset.\n",
74
- "# Feel free to add your own clips to the voices/ folder.\n",
75
- "%ls voices"
76
- ],
77
- "metadata": {
78
- "id": "SSleVnRAiEE2"
79
- },
80
- "execution_count": null,
81
- "outputs": []
82
- },
83
  {
84
  "cell_type": "code",
85
  "source": [
@@ -94,8 +81,6 @@
94
  "Though as for that the passing there\n",
95
  "Had worn them really about the same,\"\"\"\n",
96
  "\n",
97
- "# Pick one of the voices from above\n",
98
- "voice = 'train_dotrice'\n",
99
  "# Pick a \"preset mode\" to determine quality. Options: {\"ultra_fast\", \"fast\" (default), \"standard\", \"high_quality\"}. See docs in api.py\n",
100
  "preset = \"fast\""
101
  ],
@@ -108,15 +93,32 @@
108
  {
109
  "cell_type": "code",
110
  "source": [
111
- "# Fetch the voice references and forward execute!\n",
112
- "voices = get_voices()\n",
113
- "cond_paths = voices[voice]\n",
114
- "conds = []\n",
115
- "for cond_path in cond_paths:\n",
116
- " c = load_audio(cond_path, 22050)\n",
117
- " conds.append(c)\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  "\n",
119
- "gen = tts.tts_with_preset(text, conds, preset)\n",
 
 
 
120
  "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n",
121
  "IPython.display.Audio('generated.wav')"
122
  ],
@@ -129,19 +131,29 @@
129
  {
130
  "cell_type": "code",
131
  "source": [
132
- "# You can add as many conditioning voices as you want together. Combining\n",
133
- "# clips from multiple voices takes the mean of the latent space for all\n",
134
- "# voices. This creates a novel voice that is a combination of the two inputs.\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  "#\n",
136
  "# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:\n",
137
- "conds = []\n",
138
- "for v in ['pat', 'william']:\n",
139
- " cond_paths = voices[v]\n",
140
- " for cond_path in cond_paths:\n",
141
- " c = load_audio(cond_path, 22050)\n",
142
- " conds.append(c)\n",
143
  "\n",
144
- "gen = tts.tts_with_preset(\"They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.\", conds, preset)\n",
 
145
  "torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)\n",
146
  "IPython.display.Audio('captain_kirkard.wav')"
147
  ],
@@ -150,6 +162,24 @@
150
  },
151
  "execution_count": null,
152
  "outputs": []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  }
154
  ]
155
  }
 
40
  "source": [
41
  "!git clone https://github.com/neonbjb/tortoise-tts.git\n",
42
  "%cd tortoise-tts\n",
43
+ "!pip3 install -r requirements.txt\n",
44
+ "!python3 setup.py install"
45
  ]
46
  },
47
  {
 
55
  "\n",
56
  "import IPython\n",
57
  "\n",
58
+ "from tortoise.api import TextToSpeech\n",
59
+ "from tortoise.utils.audio import load_audio, load_voice, load_voices\n",
60
  "\n",
61
  "# This will download all the models used by Tortoise from the HF hub.\n",
62
  "tts = TextToSpeech()"
 
67
  "execution_count": null,
68
  "outputs": []
69
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  {
71
  "cell_type": "code",
72
  "source": [
 
81
  "Though as for that the passing there\n",
82
  "Had worn them really about the same,\"\"\"\n",
83
  "\n",
 
 
84
  "# Pick a \"preset mode\" to determine quality. Options: {\"ultra_fast\", \"fast\" (default), \"standard\", \"high_quality\"}. See docs in api.py\n",
85
  "preset = \"fast\""
86
  ],
 
93
  {
94
  "cell_type": "code",
95
  "source": [
96
+ "# Tortoise will attempt to mimic voices you provide. It comes pre-packaged\n",
97
+ "# with some voices you might recognize.\n",
98
+ "\n",
99
+ "# Let's list all the voices available. These are just some random clips I've gathered\n",
100
+ "# from the internet as well as a few voices from the training dataset.\n",
101
+ "# Feel free to add your own clips to the voices/ folder.\n",
102
+ "%ls tortoise/voices\n",
103
+ "\n",
104
+ "IPython.display.Audio('tortoise/voices/tom/1.wav')"
105
+ ],
106
+ "metadata": {
107
+ "id": "SSleVnRAiEE2"
108
+ },
109
+ "execution_count": null,
110
+ "outputs": []
111
+ },
112
+ {
113
+ "cell_type": "code",
114
+ "source": [
115
+ "# Pick one of the voices from the output above\n",
116
+ "voice = 'tom'\n",
117
  "\n",
118
+ "# Load it and send it through Tortoise.\n",
119
+ "voice_samples, conditioning_latents = load_voice(voice)\n",
120
+ "gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, \n",
121
+ " preset=preset)\n",
122
  "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n",
123
  "IPython.display.Audio('generated.wav')"
124
  ],
 
131
  {
132
  "cell_type": "code",
133
  "source": [
134
+ "# Tortoise can also generate speech using a random voice. The voice changes each time you execute this!\n",
135
+ "# (Note: random voices can be prone to strange utterances)\n",
136
+ "gen = tts.tts_with_preset(text, voice_samples=None, conditioning_latents=None, preset=preset)\n",
137
+ "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n",
138
+ "IPython.display.Audio('generated.wav')"
139
+ ],
140
+ "metadata": {
141
+ "id": "16Xs2SSC3BXa"
142
+ },
143
+ "execution_count": null,
144
+ "outputs": []
145
+ },
146
+ {
147
+ "cell_type": "code",
148
+ "source": [
149
+ "# You can also combine conditioning voices. Combining voices produces a new voice\n",
150
+ "# with traits from all the parents.\n",
151
  "#\n",
152
  "# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:\n",
153
+ "voice_samples, conditioning_latents = load_voices(['pat', 'william'])\n",
 
 
 
 
 
154
  "\n",
155
+ "gen = tts.tts_with_preset(\"They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.\", \n",
156
+ " voice_samples=None, conditioning_latents=None, preset=preset)\n",
157
  "torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)\n",
158
  "IPython.display.Audio('captain_kirkard.wav')"
159
  ],
 
162
  },
163
  "execution_count": null,
164
  "outputs": []
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "source": [
169
+ "del tts # Will break other cells, but necessary to conserve RAM if you want to run this cell.\n",
170
+ "\n",
171
+ "# Tortoise comes with some scripts that does a lot of the lifting for you. For example,\n",
172
+ "# read.py will read a text file for you.\n",
173
+ "!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=.\n",
174
+ "\n",
175
+ "IPython.display.Audio('train_atkins/combined.wav')\n",
176
+ "# This will take awhile.."
177
+ ],
178
+ "metadata": {
179
+ "id": "t66yqWgu68KL"
180
+ },
181
+ "execution_count": null,
182
+ "outputs": []
183
  }
184
  ]
185
  }
tortoise_v2_examples.html CHANGED
The diff for this file is too large to render. See raw diff