Text-to-Speech
Transformers
Safetensors
English
parler_tts
text2text-generation
annotation

One voice for multiple voiceovers

#7
by koplenov - opened

Hi, I'm trying to make a service for voicing documents.
I'm dividing the text into sentences and voicing it that way, but here's the problem - the voice is different.

Is it possible to set some kind of voice generation SID for more control of streaming?

One thing that will help somewhat is to fix the seed.

I have the same question, without being able to pick the voice it's not a practical TTS model for any serious usage.

Parler TTS org

Hey @koplenov and @juang3d , thanks for opening the issue!
It's a problem we're aware of, and one we'll be trying to solve for V1.
It's still very preliminary, but I've also experimented with fine-tuning to get consistent voices. I've finetuned the model on the 30-hours single-speaker high-quality Jenny dataset and got the following checkpoint: ylacombe/parler-tts-mini-jenny-30H.
Usage is more or less the same as Parler-TTS v0.1, just specify they keyword “Jenny” in the voice description:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler-tts-mini-jenny-30H").to(device)
tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler-tts-mini-jenny-30H")

prompt = "Hey, how are you doing today? My name is Jenny, and I'm here to help you with any questions you have."
description = "Jenny speaks at an average pace with an animated delivery in a very confined sounding environment with clear audio quality."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

set_seed(42)
# specify min length to avoid 0-length generations
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, min_length=10)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Some samples:



Let me know if this helps!

Intresting
but

What about other voices besides "Jenny" :?

@ylacombe Can you please provide a colab for your fine-tuning workflow?

@ylacombe I run the demo above. And I met this error.
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Using the model-agnostic default max_length (=2580) to control the generation length. We recommend setting max_new_tokens to control the maximum length of the generation.
Calling sample directly is deprecated and will be removed in v4.41. Use generate or a custom generation loop instead.
--- Logging error ---
Traceback (most recent call last):
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 678, in format
record.message = record.getMessage()
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/home/leyuan/VivaConversion/research/StyleTTS2/parler.py", line 19, in
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, min_length=10)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/parler_tts/modeling_parler_tts.py", line 2608, in generate
outputs = self.sample(
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2584, in sample
return self._sample(*args, **kwargs)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2730, in _sample
logger.warning_once(
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/utils/logging.py", line 329, in warning_once
self.warning(*args, **kwargs)
Message: 'eos_token_id is deprecated in this function and will be removed in v4.41, use stopping_criteria=StoppingCriteriaList([EosTokenCriteria(eos_token_id=eos_token_id)]) instead. Otherwise make sure to set model.generation_config.eos_token_id'
Arguments: (<class 'FutureWarning'>,)

@Leyuan
Try updating transformers. This worked for me
pip install --no-cache-dir transformers==4.35.0

Adding to previous comment, seems @ylacombe 's jsons point here point at "transformers_version": "4.38.2"

@ylacombe - could you expand on how you finetuned jenny? did you put "jenny" into the description?
Do you have a sample finetuning script? highly appriciated :)

Parler TTS org

Hey @shacharm ,

  1. I've created descriptions for the Jenny TTS dataset using the guide here.
  2. Then I simply use the Parler TTS training script from the repository!

Note that I'll upload a Colab with detailed steps, but that the steps above should get you started!

Thanks @ylacombe , much appreciated.
Generated a dataset with kids' shows (fun) voices (single voice).
Onward to training.

Thanks!

Parler TTS org

Hey @shacharm , thanks for the info, don't hesitate to share voice samples, or even better, the fine-tuned model!

EDIT: I can see bug below was already fixed here
Verified on my dataset - text_description generated correctly.


@ylacombe - at the very end of the dataspeech process, I believe there's a small bug under dataspeech / run_prompt_creation_single_speaker.py / prepare_dataset, causing the speaker_name not to appear in the final tagged dataset.

It's
sample_prompt.replace(f"[speaker_name]", data_args.speaker_name)

and should be
sample_prompt = sample_prompt.replace(f"[speaker_name]", data_args.speaker_name)

Otherwise "text_description"'s "[speaker_name]" isn't replaced.

parler finetuning Q:
I've set --train_dataset_config_name "default" , but I'm unsure what train_dataset_config_name is.
Although optional, run_parler_tts_training.py crashes without it.

What's its purpose?

Hi guys,
If this is of any use/interest, I finetuned the model on libritts speaker 0 (en_US-libritts-high, p3922).
GrigoriiA/parler-tts-mini-Libretta-v0.1
For dataset I took Jenny's texts and made Piper-TTS speak them in the desired voice. I followed instructions for finetuning and basically everything just worked.
Except two things:

  1. There was a problem generating 5 or so texts audios due to some missing phoneme problem. I just had to exclude the texts.
  2. During dataset tagging with "noise", the noise was always detected as high, noisy, etc. And since it was just pure piper dictation, I set manually all noise to "quite clear".
    First run with all quirks took the whole day. But I guess now it would take me around a couple of hours to do the same (dataset generation + training). The training took a bit less than 1 hour on a rented RTX 4090 in community cloud of RunPod. So it would be 0.50$ if I didn't make mistakes on the way (like "not enough empty space" - the whole process requires up to 40Gb space, etc).

One notion - seems like whatever I tell model, the voice is pretty much stays the same. Is it overtraining? Or is it dataset problem?
I would appreciate if someone competent would take a look at GrigoriiA/libretta-tts-21k-tagged. I'm sure other guys will run into similar problems too.

Piper original

Parler-Libretta "Libretta asks a question in low voice almost whispering"

Parler-Libretta 1 "A female speaker with a slightly low-pitched voice delivers her words quite expressively."

Parler-Libretta "A male speaker with a slightly low-pitched voice delivers his words quite expressively"

Parler-Libretta "A female speaker with a very high-pitched voice speaks very fast."

Parler-Libretta "A happy and cheerful female speaker is speaking extremely slowly."

Can it be the same situation as with LLMs? Perhaps we should not include solely our datasets for fine-tuning, rather we should mix fine-tuning datasets with original Parler dataset in some ratio (like 50/50, 25/75, idk). That way the model won't forget the original training data and will not loose it's ability to produce other voices.

@ylacombe can you advise on the above?
And an extra small question - in your opinion how much voice data (in minutes/hours) would one need to fully train the model?
Thanks.

Parler TTS org

Hey @GrigoriiA ,I believe the model likely overfit, how many hours of training data are you using? Feel free to send your training logs as well!

Note though that the model as it is can't generate whispering or emotions, since they were not labeled in the training dataset. So that won't work anyways.

With regards to your last question, to fully train from scratch I'd say at least 1k hours (to get somewhat decent results), more is better. To fine-tune, you would do fine with 6h, maybe even less. I hope that it helps!

Parler TTS org

BTW, here is a fine-tuning guide to reproduce fine-tuning on a single speaker dataset, using a free colab GPU: https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb

@ylacombe yeah, I actually followed your script and tried different number of hours for training - from 30 to around 6.
What I missed was to supply "gender" to my dataset. After I fixed that, and also I mixed your original training dataset into my dataset - the new model started to behave as expected.
My final training dataset was 55% of my data (4100 records) and 45% of old data (1164+332+1000+1000=~3500 records of MLS + Libri).

I noticed your new release of "expresso" where you also mixed datasets (old+new). So my intuitive guess was also right.
And I also noticed that Expresso contains Jenny dataset. But her voice is not working in this set.

prompt = "Hey, how are you doing today? My name is Jenny, and I'm here to help you with any questions you have."
description = "Jenny speaks at an average pace with an animated delivery in a very confined sounding environment with clear audio quality."
I did a retrain of expresso with your data + my voice data with same effect on my name. Perhaps, not enough attention to names?

And I sincerely applaud the results that you and your team get with your training! And also how you share all the steps and datasets. It opens a road to a lot of people to build and create.

How to add Indonesia to Parler?

Sign up or log in to comment