Comparison and exchange
Hi, I am just testing finetuning on hardware with low specs, as I have nothing else available right now (RTX4070 laptop with 8GB VRAM).
My test case was finetuning with a subset of German Emilia (used about 11k sample, which are roughly 16hours).
Due to my low specs, I have to train with batch size 1000 and just reached ~90k updates. As this is similar to your first checkpoint upload, I provide two samples for comparison.
Sample from your snapshot with 90000 updates: https://voca.ro/1k4E6cDTQuCP
Sample from mine with 90472 updates: https://voca.ro/18we8WKfaxwU
Input text (with same reference audio): wenn das mit der KI nur alles ein bisschen schneller gehen würde.
Wow, your 430k snapshot sounds great. The basemodel should read uppercse letters one.by-one, that why I had KI
in the inference test. Wonder if this could be reproduced
Testsample from 430k checkpoint of this model: https://voca.ro/14Xt81FNggN1
Issues left with your 430k ckpt:
For Codeswitching Test: Model lost ability to speak English without accent (expected, not an issue): https://voca.ro/1OOuw3UbYmok
For reference audio which is not part of German Emilia, the model produces mostly garbage (maybe overfitting?!): https://voca.ro/1iLtRUnz26H9
Problems with uppercase letters (should be spoken letter by letter) and sometime dropping initial words
Nach dem ABC kommt nicht direkt das XYZ
: https://voca.ro/143tOYSH4Lbc
For pure german with Emilia reference, best model so far. Thx for your work
Update: OOD reference audio is working, it was my fault as I gave the wrong transcription for the reference Audio wenn das mit der KI nur alles ein bisschen schneller gehen würde.
: https://voca.ro/1abQVsrWfgQa
Thanks mame82. I have planned about 1M steps in total for finetuning. According to these tensorboard graphics, I guess there is no overfit yet. And here is finetuning settings:
{
"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 3200,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 96,
"num_warmup_updates": 3984,
"save_per_updates": 10000,
"last_per_steps": 2000,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "pinyin",
"tokenizer_file": "",
"mixed_precision": "bf16",
"logger": "tensorboard"
}
My assumption for overfitting to dataset speakers was wrong. It was an issue on my end, as I applied a wrong Transcription to the reference audio in the test (updated my last post with this info).
I wonder why the vocab.txt still has to have all of the non-latin entries. Are you planning to keep support for ZH in your model? If it is meant to be German only, do you think the large vocab will have a performance penalty during inference (compared to a reduced vocab)?
Thank you for your efforts, again
This is actually my third attempt at training. I did the first one when the f5-tts model first came out and failed. For the 2nd one I took the vocab.txt content only from the transcript content. So the vocab content was reduced by 90%. I did pretraining and failed again. This third attempt seems to be successful so far. I can't be sure about the performance until I try this. But in my opinion, I don't think there will be much performance change because the trained model sizes are almost the same for both models. This means that the tensor sizes should also be almost the same. So there shouldn't be a big change in performance.
Hello, i tried to use your modeln in SWivid/F5-TTS webui but i got this error, could you maybe help me? Where should i put the vocab.txt?
_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CFM:
size mismatch for transformer.text_embed.text_embed.weight: copying a param with shape torch.Size([3125, 512]) from checkpoint, the shape in current model is torch.Size([2546, 512]).
Comparing your checkpoints (430k steps and 1m steps) to my experiment (120k steps on 16h German Emilia, 40k steps on 40h German Emilia):
wenn das mit der KI nur ein bisschen schneller gehen würde
(out of dataset reference audio)
Your 430k:
Your 1m:
My erxperiment (~160k):
Second test Ein schwieriger Satz mit Dornröschen und der Abkürzung CDU
your 430k:
your 1m:
my experiment:
I think the models need additional training with uppercase letters, which should be spoken letter by letter (it basically works, but still uses english pronounciation of letters). I have no diea if special pronounciation of German words could be adressed with a generic approach (like enforcing correct pronounciation of "Dornröschen" with input like "Dornrös-chen"), but assume such words with irregular pronounciation have to be put into the training set, too.
Again, thank you for this fantastic model