Comparison and exchange

#1
by mame82 - opened

Hi, I am just testing finetuning on hardware with low specs, as I have nothing else available right now (RTX4070 laptop with 8GB VRAM).

My test case was finetuning with a subset of German Emilia (used about 11k sample, which are roughly 16hours).
Due to my low specs, I have to train with batch size 1000 and just reached ~90k updates. As this is similar to your first checkpoint upload, I provide two samples for comparison.

Sample from your snapshot with 90000 updates: https://voca.ro/1k4E6cDTQuCP

Sample from mine with 90472 updates: https://voca.ro/18we8WKfaxwU

Input text (with same reference audio): wenn das mit der KI nur alles ein bisschen schneller gehen würde.

Wow, your 430k snapshot sounds great. The basemodel should read uppercse letters one.by-one, that why I had KI in the inference test. Wonder if this could be reproduced

Testsample from 430k checkpoint of this model: https://voca.ro/14Xt81FNggN1

Issues left with your 430k ckpt:

  • For Codeswitching Test: Model lost ability to speak English without accent (expected, not an issue): https://voca.ro/1OOuw3UbYmok

  • For reference audio which is not part of German Emilia, the model produces mostly garbage (maybe overfitting?!): https://voca.ro/1iLtRUnz26H9

  • Problems with uppercase letters (should be spoken letter by letter) and sometime dropping initial words Nach dem ABC kommt nicht direkt das XYZ: https://voca.ro/143tOYSH4Lbc

For pure german with Emilia reference, best model so far. Thx for your work

Update: OOD reference audio is working, it was my fault as I gave the wrong transcription for the reference Audio wenn das mit der KI nur alles ein bisschen schneller gehen würde.: https://voca.ro/1abQVsrWfgQa

Thanks mame82. I have planned about 1M steps in total for finetuning. According to these tensorboard graphics, I guess there is no overfit yet. And here is finetuning settings:

{
    "exp_name": "F5TTS_Base",
    "learning_rate": 1e-05,
    "batch_size_per_gpu": 3200,
    "batch_size_type": "frame",
    "max_samples": 64,
    "grad_accumulation_steps": 1,
    "max_grad_norm": 1,
    "epochs": 96,
    "num_warmup_updates": 3984,
    "save_per_updates": 10000,
    "last_per_steps": 2000,
    "finetune": true,
    "file_checkpoint_train": "",
    "tokenizer_type": "pinyin",
    "tokenizer_file": "",
    "mixed_precision": "bf16",
    "logger": "tensorboard"
}

Screenshot from 2024-11-10 18-57-25.png

Screenshot from 2024-11-10 18-57-36.png

marduk-ra changed discussion status to closed
marduk-ra changed discussion status to open

My assumption for overfitting to dataset speakers was wrong. It was an issue on my end, as I applied a wrong Transcription to the reference audio in the test (updated my last post with this info).

I wonder why the vocab.txt still has to have all of the non-latin entries. Are you planning to keep support for ZH in your model? If it is meant to be German only, do you think the large vocab will have a performance penalty during inference (compared to a reduced vocab)?

Thank you for your efforts, again

This is actually my third attempt at training. I did the first one when the f5-tts model first came out and failed. For the 2nd one I took the vocab.txt content only from the transcript content. So the vocab content was reduced by 90%. I did pretraining and failed again. This third attempt seems to be successful so far. I can't be sure about the performance until I try this. But in my opinion, I don't think there will be much performance change because the trained model sizes are almost the same for both models. This means that the tensor sizes should also be almost the same. So there shouldn't be a big change in performance.

Hello, i tried to use your modeln in SWivid/F5-TTS webui but i got this error, could you maybe help me? Where should i put the vocab.txt?
_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CFM:
size mismatch for transformer.text_embed.text_embed.weight: copying a param with shape torch.Size([3125, 512]) from checkpoint, the shape in current model is torch.Size([2546, 512]).

Screenshot from 2024-11-12 00-09-06.png

Comparing your checkpoints (430k steps and 1m steps) to my experiment (120k steps on 16h German Emilia, 40k steps on 40h German Emilia):

wenn das mit der KI nur ein bisschen schneller gehen würde (out of dataset reference audio)

Your 430k:

Your 1m:

My erxperiment (~160k):

Second test Ein schwieriger Satz mit Dornröschen und der Abkürzung CDU

your 430k:

your 1m:

my experiment:

I think the models need additional training with uppercase letters, which should be spoken letter by letter (it basically works, but still uses english pronounciation of letters). I have no diea if special pronounciation of German words could be adressed with a generic approach (like enforcing correct pronounciation of "Dornröschen" with input like "Dornrös-chen"), but assume such words with irregular pronounciation have to be put into the training set, too.

Again, thank you for this fantastic model

This comment has been hidden

Sign up or log in to comment