speecht5_finetuned_commonvoice_id
This model is a fine-tuned version of microsoft/speecht5_tts on the mozilla-foundation/common_voice_16_1 dataset. It achieves the following results on the evaluation set:
- Loss: 0.4675
How to use/inference
Follow the example below and adapt with your own need.
# ft_t5_id_inference.py
import sounddevice as sd
import torch
import torchaudio
from datasets import Audio, load_dataset
from transformers import (
SpeechT5ForTextToSpeech,
SpeechT5HifiGan,
SpeechT5Processor,
)
from utils import create_speaker_embedding
# load dataset and pre-trained model
dataset = load_dataset(
"mozilla-foundation/common_voice_16_1", "id", split="test")
model = SpeechT5ForTextToSpeech.from_pretrained(
"Bagus/speecht5_finetuned_commonvoice_id")
# process the text using checkpoint
checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)
sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
def prepare_dataset(example):
audio = example["audio"]
example = processor(
text=example["sentence"],
audio_target=audio["array"],
sampling_rate=audio["sampling_rate"],
return_attention_mask=False,
)
# strip off the batch dimension
example["labels"] = example["labels"][0]
# use SpeechBrain to obtain x-vector
example["speaker_embeddings"] = create_speaker_embedding(audio["array"])
return example
# prepare the speaker embeddings from the dataset and text
example = prepare_dataset(dataset[30])
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
# prepare text to be converted to speech
text = "Saya suka baju yang berwarna merah tua."
inputs = processor(text=text, return_tensors="pt")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
speech = model.generate_speech(
inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
sampling_rate = 16000
sd.play(speech, samplerate=sampling_rate, blocking=True)
# save the audio, signal needs to be in 2D tensor
torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 4000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.5394 | 4.28 | 1000 | 0.4908 |
0.5062 | 8.56 | 2000 | 0.4730 |
0.5074 | 12.83 | 3000 | 0.4700 |
0.5023 | 17.11 | 4000 | 0.4675 |
Framework versions
- Transformers 4.35.2
- Pytorch 2.1.1+cu121
- Datasets 2.15.0
- Tokenizers 0.15.0
- Downloads last month
- 51
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Bagus/speecht5_finetuned_commonvoice_id
Base model
microsoft/speecht5_tts