drewThomasson
/

Xtts-Finetune-Bryan-Cranston

Model card Files Files and versions Community

Finetuned Voices Not Working for Me - Docker

by jdana - opened Oct 30, 2024

Discussion

jdana

Oct 30, 2024

•

edited Oct 30, 2024

First of all you are a legend and actually bringing a dream of mine to life so thanks for your work!

The default voice works great for me but I am struggling to get the fine tuned models you have available to work. I tried pasting the link as well as downloading and adding in the files manually. But it either gets stuck at the final stage (never produces an output) or just uses the default voice. Should I be uploading a WAV file to the target voice file section as well as adding the appropriate files to the custom model section?

Thanks!

Docker - Windows 10 - RTX 3090

jdana changed discussion title from Finetune Voices not working for me - Docker to Finetuned Voices Not Working for Me - Docker Oct 30, 2024

drewThomasson

Owner Oct 30, 2024

You should be uploading a WAV file to the target voice file section as well as the custom model section.

Use the included ref.wav as the target voice file
https://huggingface.co/drewThomasson/Xtts-Finetune-Bryan-Cranston/blob/main/V2_Xtts-Finetune-Bryan-Cranston/ref_audio_for_v2.wav
The ref.wav is needed because xtts is a voice-cloning model and by fine-tuning it on a voice it just makes it a LOT better at cloning a specific voice.

When using this model you should be pasting the direct download link to the `finished_model_files.zip` into the link field for ebook2audiobookxtts

For example this model would be this link to paste for V2--->

https://huggingface.co/drewThomasson/Xtts-Finetune-Bryan-Cranston/resolve/main/V2_Xtts-Finetune-Bryan-Cranston/Finished_model_files.zip?download=true

jdana

Oct 30, 2024

Thank you!!

drewThomasson

Owner Oct 30, 2024

Also what did you mean by But it either gets stuck at the final stage (never produces an output)?

Like, it gets stuck at the final combining audiobook stage?

Do you not see the output in the web gui when you click the Download Audiobook Files button?

drewThomasson

Owner Oct 30, 2024

•

edited Oct 30, 2024

If your trying to generate the same book again without re-launching the docker image to reset it then:

You might need to check the terminal to see what its outputting cause it might be asking you if want to over-write the old audiobook with the new one? you'll have to respond with y/n in the terminal if thats the case lol
If you don't tell if what to do the it'll just keep waiting for your response forever lol

jdana

Oct 30, 2024

That's correct, the "90%" portion, I forget exactly what the terminal was saying at that point but it was consistent. No output in the web gui

jdana

Oct 30, 2024

lol It wasn't giving the over-write prompt (I got that one before). I will let you know if it happens again.

drewThomasson

Owner Oct 30, 2024

Hm, well if you get the It getting stuck issue again then send a screenshot of what the web gui looks like and what the terminal also looks like or what it's saying is going on lol 👍

drewThomasson

Owner Oct 30, 2024

•

edited Oct 30, 2024

Well I can confirm that the model works in it on my end at least lol

Just used it to make Bryan Cranston read the "a tell tale heart" short story

https://huggingface.co/drewThomasson/Xtts-Finetune-Bryan-Cranston/blob/main/V2_Xtts-Finetune-Bryan-Cranston/generated-example-poe-tell-tale-heart.m4b

jdana

Oct 30, 2024

Working now!

Any reason why some voices are more consistent than others? Bryan Cranston seems to have less defects than David Walliams, with the default female speaker has the least hallucinations of all three.

Is there a speaker you know of that has the most reliable output? I am using the same text as input btw (1 paragraph)

drewThomasson

Owner Oct 30, 2024

•

edited Oct 30, 2024

HMMMMM I mostly forgot tbh

But it also depends on the input ref audio?, idk noise and stuff in it I guess?
It's mostly an art of just messing around with them

Perhaps in the future I'll find some automated way to determine how much each most hallucinates using whisper and such and put those ratings on the model README

The David Attenborough is pretty good also lol

Example David reading tell tale hearts

Also:

If you go into the audio_generation_settings in the gradio interface you can turn the temperature all the way down, that should make it hallucinate less.

jdana

Oct 30, 2024

Nice, yeah from the ones I have tested, Attenborough and the original voice seem to generate the best results.

Any correlation between model size and quality I wonder?

Thanks for the tip on adjusting the temperature, my next step is to mess around with all of those settings!

drewThomasson

Owner Oct 30, 2024

•

edited Oct 30, 2024

what do you mean model size? 🤨

All of these are just a fine-tuned xtts v2 models, the parameter count never changes lol

jdana

Oct 30, 2024

Oh my bad I meant the dataset size. It seemed like there was a relationship between the dataset size and the quality of the output but that could be completely coincidental. The quality of the data might be more important the size. But the issues I was running into seemed different than having "bad" audio for training. I would much rather have subpar audio than glitched audio (hallucinations/artifacts), which was the issue I was running into

jdana

Oct 30, 2024

Is the base model (the default one) xtts v2 without any fine tuning? Or are you applying a fine tune to that one?

drewThomasson

Owner Oct 30, 2024

•

edited Oct 30, 2024

The base model isn't fine-tuned its just normal

The quality of a fine-tuned appears to be directly associated with:

(dataset size, with a max of 40 minutes is what I have found to be good, once you have training datasets larger than that for a single voice then the model becomes "Overfit" leading to more weird hallucinations)
Dataset quality: So like making sure the audio is clean and there isn't any other background sounds and not being noisesy so denoise it
Dataset quality pitch wise: Seeems that normalizing the training input also helps cause if the volume fluxuates too much on the voice your training it on then the model get's issues from that.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Finetuned Voices Not Working for Me - Docker

You should be uploading a WAV file to the target voice file section as well as the custom model section.

When using this model you should be pasting the direct download link to the finished_model_files.zip into the link field for ebook2audiobookxtts

For example this model would be this link to paste for V2--->

If your trying to generate the same book again without re-launching the docker image to reset it then:

HMMMMM I mostly forgot tbh

The David Attenborough is pretty good also lol

Example David reading tell tale hearts

Also:

The quality of a fine-tuned appears to be directly associated with:

When using this model you should be pasting the direct download link to the `finished_model_files.zip` into the link field for ebook2audiobookxtts