Ratan Tata SpeechT5 Voice Cloning Model
This model is a Text-to-Speech (TTS) system using SpeechT5 architecture, trained on the Ratan Tata TTS Dataset to generate high-quality synthetic speech resembling the voice of Ratan Tata. The dataset and model pay tribute to his legacy, preserving his voice through cutting-edge AI technology.
Model Information
- Model Architecture: SpeechT5 (Text-to-Speech)
- Training Dataset: Ratan Tata TTS Dataset (English)
- Checkpoints: 60,000 steps
- Framework: PyTorch
- Model Size: Approximately 1.9GB
- License: OpenRAIL
Dataset Summary
This model was trained on over 2,800 seconds (~48 minutes) of high-quality speech samples from Ratan Tata, with detailed transcriptions for each audio file. The audio data was pre-processed, converted to a uniform format, and aligned with the corresponding text to ensure optimal training performance.
Model Performance
- Voice Quality: The model replicates the unique tone, cadence, and voice texture of Ratan Tata with high accuracy, making it suitable for various voice cloning applications.
- Sample Rate: 16 kHz (consistent with the training data)
- Audio Channels: Mono
- Bit Depth: 16-bit
- Precision: High-quality synthesis using SpeechT5
How to Use the Model
You can use this model for a variety of TTS and voice synthesis tasks. It is designed to work with any standard TTS pipeline and can be integrated into projects for generating Ratan Tata’s voice in any text-based scenario.
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from speechbrain.pretrained import EncoderClassifier
from IPython.display import Audio
from datasets import load_dataset
import noisereduce as nr
import soundfile as sf
import os, torchaudio
import numpy as np
import torch
# Load the processor and model
processor = SpeechT5Processor.from_pretrained("checkpoint-60000") # Replace with the model folder
processor.tokenizer.split_special_tokens = True
model = SpeechT5ForTextToSpeech.from_pretrained("checkpoint-60000") # Replace with the model folder
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load speaker embeddings dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
# Load the speaker model
spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
device = "cuda" if torch.cuda.is_available() else "cpu"
speaker_model = EncoderClassifier.from_hparams(
source=spk_model_name,
run_opts={"device": device},
savedir=os.path.join("/tmp", spk_model_name),
)
# Load and process the Ratan Tata voice file
signal, fs = torchaudio.load('wavs/converted_ratan_tata_tts_200.wav') # Replace with a Ratan Tata voice file
speaker_embeddings = speaker_model.encode_batch(signal)
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2).squeeze().cpu().numpy()
speaker_embeddings = torch.tensor(np.array([speaker_embeddings]))
# Define input text
input_text = '''
This is Generated Audio.
India, a land of ancient wisdom and boundless potential, stands at the cusp of a new era. Our youth, the vibrant heartbeat of our nation, hold the key to unlocking this potential...
'''
# Split text into chunks based on character length
def split_text_by_length(text, max_length=60):
words = text.split()
result = []
current_line = []
for word in words:
if len(' '.join(current_line + [word])) > max_length:
result.append(' '.join(current_line))
current_line = [word]
else:
current_line.append(word)
if current_line:
result.append(' '.join(current_line))
return result
splited_text = split_text_by_length(input_text, max_length=80)
# Generate speech for each text chunk and apply noise reduction
all_speech = []
for i in splited_text:
inputs = processor(text=i, return_tensors="pt")
speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
if isinstance(speech_chunk, torch.Tensor):
speech_chunk = speech_chunk.cpu().numpy()
reduced_noise_chunk = nr.reduce_noise(y=speech_chunk, sr=16000) # assuming 16kHz sample rate
all_speech.append(reduced_noise_chunk)
# Concatenate all speech chunks
concatenated_speech = np.concatenate(all_speech)
# Play the final audio
Audio(concatenated_speech, rate=16000)