Training Code

#1
by yukiarimo - opened

Hello!

Can you please share the exact training you used for this model? I am very happy that LJSpeech worked!

I can try training on more professional dataset we have recorded, maybe add some language/speaker embedding and Vocos (or something faster?) for decoding, and see how it goes.

Because if the architecture is working for you—it means the whole world!

Note: assuming you initialized from random weights and this is pure training, no SSL model used

Hey, thanks for checking the model out.

I put the training code here https://github.com/Banaxi-Tech/bananamind-tts-v1-training-code

The model was trained on LJSpeech. It is a small custom TTS model, not something based on a big pretrained SSL model.

I think the results would get much better with a cleaner dataset. Speaker or language embeddings would also be interesting to try next. I also want to experiment more with the decoder/vocoder side, maybe Vocos.

If you test it or train your own version, I would be interested to see the results.

Hello again! I have successfully modified the architecture (added speaker & language embeddings), switched to API, and it indeed produces more emotional results to VITS!

I have used my 48 kHz two-speaker English dataset (but kept the embedding at more languages open for fine-tuning) for training both this architecture and Vocos (identical datasets & hyperparameters).

Skipping postnet improves quality at inference. However, one issue remains:

Vocos is trained on REAL audio (and encode-decode is audibly lossless, I’ve tested), but synthesized audio sounds a bit muffled. Do you have any ideas how this is solvable?

Hey, nice, that is really interesting.

One thing I would check first: did you change all the audio settings from my original 22.05 kHz setup to 48 kHz? My training code was made around 22.05 kHz, so if sample rate / n_fft / hop length / win length / mel f_max are still partly using the old values, that could cause exactly this kind of muffled result.

Hey, nice, that is really interesting.

One thing I would check first: did you change all the audio settings from my original 22.05 kHz setup to 48 kHz? My training code was made around 22.05 kHz, so if sample rate / n_fft / hop length / win length / mel f_max are still partly using the old values, that could cause exactly this kind of muffled result.

That may also be already from the original model/training you could try our space to see if you notice that same muffled audio https://huggingface.co/spaces/Banaxi-Tech/BananaMind-TTS-Demo.
If yes than it may be because my model is using Griffin-Lim

One thing I would check first: did you change all the audio settings from my original.

Yes, I changed everything and optimized all parameters specifically for 48 kHz.

That may also be already from the original model/training you could try our space to see if you notice that same muffled audio.

Yes, but mine will sound the same if Griffin-Lim. However, I think it made be Vocos’s issue.

I asked AI, and it said that reduction factor may be the case, but no, because even without it, generated Mel’s would still be a bit wobbly to Vocos, cause it is not used to them.

Have you seen how previous Mel-based models solved this? Cause if I remember correctly, they also just used external Vocoder!

Yeah, that makes sense.

I don’t think reduction factor is the main cause either. If the predicted mels are already a bit wobbly/smooth, then Vocos is still getting something different from the real mels it was trained on.

Older mel-based TTS models did use external vocoders, but I think the important part is that the vocoder has to survive predicted mels, not only ground-truth mels. Real-mel reconstruction being good only proves analysis/synthesis works. It does not prove the vocoder is robust to the acoustic model’s errors.

So I would probably try one of these:

train/finetune Vocos with generated/predicted mels too, not only real mels
mix real mels and predicted mels during vocoder training
add noise/jitter/smoothing augmentation to real mels when training Vocos, so it becomes less sensitive

VITS avoids some of this because it is more end-to-end, so the decoder/vocoder side is trained around the model’s own latent/audio distribution. With a separated mel model + Vocos, the acoustic model can make mels that look close by L1/MSE but still are bad inputs for the vocoder.

So yeah, I would not blame Vocos completely, but I also would not expect a Vocos trained only on real mels to handle predicted mels perfectly.

One thing I would check first: did you change all the audio settings from my original.

Yes, I changed everything and optimized all parameters specifically for 48 kHz.

That may also be already from the original model/training you could try our space to see if you notice that same muffled audio.

Yes, but mine will sound the same if Griffin-Lim. However, I think it made be Vocos’s issue.

I asked AI, and it said that reduction factor may be the case, but no, because even without it, generated Mel’s would still be a bit wobbly to Vocos, cause it is not used to them.

Have you seen how previous Mel-based models solved this? Cause if I remember correctly, they also just used external Vocoder!

Im already training one using a HIFI-GAN Right now, Probably done in 3 days, I already tested the Early One And it already sounds a lot better than Griffin LIM!

Why HiFi Gan instead of Vocos? I heard the main benefits Vocos are:

  1. It is faster.
  2. Because of modeling in frequency domain, there will be no metallic buzzing, which often occurs in voices with high f0.
  3. Less steps are needed for convergence :)

Yeah, Vocos has those benefits, especially speed and less metallic artifacts.

I am not really saying HiFi-GAN is better than Vocos in general. I am trying HiFi-GAN first mostly because it is a very common baseline for mel-based TTS, and I want to check if the muffled sound is really from the acoustic model or from the vocoder setup.

I'm also trying it because it's a simpler option and easier for me to experiment with right now.

If HiFi-GAN sounds much clearer on the same predicted mels, then that tells me Vocos may be more sensitive to my predicted mel distribution. If HiFi-GAN is also muffled, then the problem is probably more in the mel prediction/model side.

So for me it is more of a debugging step than a final decision. Vocos is still interesting, but HiFi-GAN is useful because many older mel-TTS models used GAN vocoders successfully, so it gives me a good comparison point.

Epoch 20 is done training, the results already look very good. https://drive.google.com/file/d/1DAtQKa-GJibECBdaH9Av7et6FND-EzKy/view?usp=sharing

If you want you can try this now, its the new one with HiFi-GAN, The Training Code Is Also Updated!
https://huggingface.co/Banaxi-Tech/BananaMind-TTS-V2

If you want you can try this now, its the new one with HiFi-GAN, The Training Code Is Also Updated!
https://huggingface.co/Banaxi-Tech/BananaMind-TTS-V2

Maybe you could try training using HiFi-GAN too, it seems extremely good right now.

Sign up or log in to comment