Thanks for the model and Q on fine-tuning

#3
by RonanMcGovern - opened

The voice on this model is nice! (and nice work creating the range of models you have).

I've tried some different embeddings on the base speecht5 model and the speech sounds quite flat (example below)

  1. Would you say speecht5 is close to state of the art?
  2. For fine-tuning, do you fully fine tune OR do you do LoRA (is this even possible?)
  3. Seems the max length is 600 tokens, but would you think it's best to stay well below that.

Thanks


English example, base speecht5:

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf
from datasets import load_dataset
from IPython.display import Audio

# Ensure CUDA is available for GPU processing
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load the model, processor, and vocoder
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts",device_map=device)
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts",device_map=device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan",device_map=device)

# Read text from 'raw.txt' file
with open('raw.txt', 'r') as file:
    text = file.read()

# Prepare inputs
inputs = processor(text=text, return_tensors="pt").input_ids.to(device)

# Load speaker embeddings (example, you might need to adjust this)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[2300]["xvector"]).unsqueeze(0).to(device)

# Generate speech
with torch.no_grad():
    speech = model.generate_speech(inputs, speaker_embeddings= speaker_embeddings, vocoder=vocoder)

# Move generated speech to CPU and convert to numpy for saving
speech_audio = speech.cpu().numpy()

# Save the audio
sf.write("speech.wav", speech_audio, samplerate=16000)

# Play the audio
Audio("speech.wav")

Thank you @RonanMcGovern !

For me the following speaker embeddings worked quite well on all the languages I worked with:

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7440]["xvector"]).unsqueeze(0)

I tried some other embeddings but for some reason the above one worked better on all those languages.

Regarding your questions:

  1. I think speecht5 is a great step towards achieving state of the art, but for sure there is still margin for improvement and there has been some development in the last few months which has improved the quality both from Microsoft (https://www.microsoft.com/en-us/research/project/speechx/) and based on the latest developments on LLMs.
  2. I fully fine tuned the models as speecht5 models are not so big and, therefore, is not tough to fit them in GPU memory and does not need a lot of time, assuming normal data size (audios).
  3. It depends on the use case. In general I think is easier for the model to learn and perform better using smaller audios and, therefore, number of tokens, and for inference you could still break longer audios into chunks and do parallelisation, but of course depends on the exact use case.

Hope it helps!

Those answers are much appreciated, thanks Sandiago. I just tried out that embedding and it's not bad.

Do you have any experience or recommendation on a package to create an embedding for a voice?

One challenge I see in creating embeddings is that many audio files have two speakers and it can be hard to separate out those two speakers. i.e. it can be easy to get voice data (and then transcribe with whisper) for public figures, but then probably harder to extract a single voice.

Thanks, Ronan

You are welcome @RonanMcGovern ! Glad that I can help :)

Unfortunately no as so far I have always been using the ones provided by "Matthijs/cmu-arctic-xvectors". Do you want to create embeddings for a specific voice?

Yeah exactly, I'd like to create embeddings for a specific voice, given some audio of that voice (and using whisper to make the transcript).

Then a potential approach would be that you start with creating some initial embeddings from the audios with that voice and then use the trained embeddings to create some augmented data and further train. That is, use the initial audios of that voice, train initial voice embeddings, use a text to speech model along with some texts (e.g. random texts from various audios datasets for diversification) and your voice embeddings to generate additional audios and further train your voice embeddings. Like a loop for hopefully better fine tuned voice embeddings.

Sign up or log in to comment