--- language: - id license: mit base_model: microsoft/speecht5_tts tags: - text-to-speech datasets: - mozilla-foundation/common_voice_16_1 model-index: - name: speecht5_finetuned_commonvoice_id results: [] --- # speecht5_finetuned_commonvoice_id This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the mozilla-foundation/common_voice_16_1 dataset. It achieves the following results on the evaluation set: - Loss: 0.4675 ## How to use/inference Follow the example below and adapt with your own need. ``` # ft_t5_id_inference.py import sounddevice as sd import torch import torchaudio from datasets import Audio, load_dataset from transformers import ( SpeechT5ForTextToSpeech, SpeechT5HifiGan, SpeechT5Processor, ) from utils import create_speaker_embedding # load dataset and pre-trained model dataset = load_dataset( "mozilla-foundation/common_voice_16_1", "id", split="test") model = SpeechT5ForTextToSpeech.from_pretrained( "Bagus/speecht5_finetuned_commonvoice_id") # process the text using checkpoint checkpoint = "microsoft/speecht5_tts" processor = SpeechT5Processor.from_pretrained(checkpoint) sampling_rate = processor.feature_extractor.sampling_rate dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate)) def prepare_dataset(example): audio = example["audio"] example = processor( text=example["sentence"], audio_target=audio["array"], sampling_rate=audio["sampling_rate"], return_attention_mask=False, ) # strip off the batch dimension example["labels"] = example["labels"][0] # use SpeechBrain to obtain x-vector example["speaker_embeddings"] = create_speaker_embedding(audio["array"]) return example # prepare the speaker embeddings from the dataset and text example = prepare_dataset(dataset[30]) speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) # prepare text to be converted to speech text = "Saya suka baju yang berwarna merah tua." inputs = processor(text=text, return_tensors="pt") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") speech = model.generate_speech( inputs["input_ids"], speaker_embeddings, vocoder=vocoder) sampling_rate = 16000 sd.play(speech, samplerate=sampling_rate, blocking=True) # save the audio, signal needs to be in 2D tensor torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000) ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 4 - eval_batch_size: 2 - seed: 42 - gradient_accumulation_steps: 8 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 4000 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 0.5394 | 4.28 | 1000 | 0.4908 | | 0.5062 | 8.56 | 2000 | 0.4730 | | 0.5074 | 12.83 | 3000 | 0.4700 | | 0.5023 | 17.11 | 4000 | 0.4675 | ### Framework versions - Transformers 4.35.2 - Pytorch 2.1.1+cu121 - Datasets 2.15.0 - Tokenizers 0.15.0