Welcome to Tortoise! 🐒🐒🐒🐒

Before you begin, I **strongly** recommend you turn on a GPU runtime.

There's a reason this is called "Tortoise" - this model takes up to a minute to perform inference for a single sentence on a GPU. Expect waits on the order of hours on a CPU.

In [None]:
!git clone https://github.com/neonbjb/tortoise-tts.git
%cd tortoise-tts
!pip install -r requirements.txt

In [None]:
# Imports used through the rest of the notebook.
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

from api import TextToSpeech
from utils.audio import load_audio, get_voices

# This will download all the models used by Tortoise from the HF hub.
tts = TextToSpeech()

In [None]:
# List all the voices available. These are just some random clips I've gathered
# from the internet as well as a few voices from the training dataset.
# Feel free to add your own clips to the voices/ folder.
%ls voices

In [None]:
# This is the text that will be spoken.
text = "Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?"

# Here's something for the poetically inclined.. (set text=)
"""
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,"""

# Pick one of the voices from above
voice = 'dotrice'
# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
preset = "fast"

In [None]:
# Fetch the voice references and forward execute!
voices = get_voices()
cond_paths = voices[voice]
conds = []
for cond_path in cond_paths:
 c = load_audio(cond_path, 22050)
 conds.append(c)

gen = tts.tts_with_preset(text, conds, preset)
torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)

In [None]:
# You can add as many conditioning voices as you want together. Combining
# clips from multiple voices takes the mean of the latent space for all
# voices. This creates a novel voice that is a combination of the two inputs.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
conds = []
for v in ['pat', 'william']:
 cond_paths = voices[v]
 for cond_path in cond_paths:
 c = load_audio(cond_path, 22050)
 conds.append(c)

gen = tts.tts_with_preset("They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.", conds, preset)
torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)