Overview : This project implements a Text-to-Speech (TTS) system using the SpeechT5 model fine-tuned on the LJ Speech dataset. The goal is to generate natural-sounding speech from textual input, leveraging state-of-the-art machine learning techniques and for english technical words.
Link to the code: https://github.com/lithish262004/TTS_FINETUNE_ENGLISH
Methodology This section outlines the detailed steps taken for model selection, dataset preparation, and fine-tuning of the Text-to-Speech (TTS) system using the SpeechT5 model.
- Model Selection Model Choice: SpeechT5
SpeechT5 is a transformer-based model designed for TTS tasks. It provides a balance of performance and efficiency, making it suitable for generating high-quality speech. We selected SpeechT5 due to its ability to handle various speech generation tasks, including text-to-speech synthesis, making it versatile for our project. 2. Dataset Preparation Dataset: LJ Speech
The LJ Speech dataset comprises approximately 13,100 audio clips of a single speaker reading passages from various texts. It includes: Audio Files: WAV format audio samples. Transcripts: Text files containing the spoken text for each audio sample. Preparation Steps:
Data Download: Download the LJ Speech dataset from keithito.com
Preprocessing:
Audio Processing: Convert all audio files to a consistent format (16000 Hz sample rate, mono channel). Normalize audio levels to ensure consistent volume across samples. Text Cleaning: Remove any extraneous characters or formatting from the transcripts to ensure clean input for the model. Optionally, apply phonetic transcriptions for improved pronunciation. Alignment:
Generate a mapping between audio files and their corresponding text transcripts, ensuring that each audio clip can be paired with its correct spoken text. Train-Validation Split:
Split the dataset into training and validation sets (e.g., 90% training, 10% validation) to evaluate model performance during and after fine-tuning. 3. Fine-Tuning Fine-Tuning Process:
Environment Setup:
Ensure that all necessary libraries and dependencies (e.g., PyTorch, Transformers, NumPy) are installed as specified in the requirements.txt file. Fine-Tuning Script:
Model loading: Load the pre-trained SpeechT5 model. Data loading: Use a data loader to read audio and text pairs for training. Training loop: Implement a loop that iterates over the training dataset for a specified number of epochs, updating model weights based on the loss calculated from predictions. Hyperparameter Configuration:
Set hyperparameters such as learning rate, batch size, and the number of epochs. Commonly used values might include: Learning Rate: 1e-5 Batch Size: 16 Epochs: 10/11 Monitoring:
Monitor training loss and validation metrics during the fine-tuning process to prevent overfitting. Use techniques like early stopping if validation loss starts to increase. Model Saving:
After fine-tuning, save the trained model and any associated artifacts (e.g., tokenizer) for later use in generating speech. 4. Evaluation Post-Fine-Tuning Evaluation:
After fine-tuning, evaluate the model using the validation dataset. Metrics for evaluation may include: Mean Opinion Score (MOS): A subjective score based on human evaluations of audio quality. Alignment and accuracy: Check how well the generated speech aligns with the input text
TRAINING PROGRESS:
Step Training Loss Validation Loss
1000 0.437300 0.389510
2000 0.411300 0.379249
3000 0.415100 0.376344
4000 0.416200 0.375081
Environment and Dependencies Transformers: 4.44.2 PyTorch: 2.4.1+cu121 Datasets: 3.0.1 Tokenizers: 0.19.1
RESULTS:
Results In this section, we present the results of our Text-to-Speech (TTS) system, including both objective and subjective evaluations. We tested the model's performance on two types of speech: English technical speecH
Objective Evaluations Metrics Used: Mean Opinion Score (MOS): A numerical measure of perceived audio quality, usually rated on a scale from 1 (poor) to 5 (excellent). English Technical Speech Subjective evaluations were conducted through listener studies, where participants rated the audio samples generated by the model. The evaluations focused on clarity, naturalness, and overall satisfaction. English Technical Speech Feedback Clarity: Most listeners appreciated the clarity of the speech, especially in technical contexts. Naturalness: While the speech was generally natural, some participants felt that the intonation could be improved to sound more conversational. Comments: “Very clear for technical topics, but sometimes feels robotic.
Challenges and Solutions: Dataset Challenges Limited availability of high-quality ENGLISH speech data with technical terms
Conclusion:
Successfully fine-tuned SpeechT5 for english language
Maintained high quality while optimizing performance
Future Improvements: Expand dataset with more diverse speakers
Further optimize inference speed
Recommendations:
Regular model retraining with expanded datasets
Integration of automated quality assessment tools