Training Code
Hello!
Can you please share the exact training you used for this model? I am very happy that LJSpeech worked!
I can try training on more professional dataset we have recorded, maybe add some language/speaker embedding and Vocos (or something faster?) for decoding, and see how it goes.
Because if the architecture is working for you—it means the whole world!
Note: assuming you initialized from random weights and this is pure training, no SSL model used
Hey, thanks for checking the model out.
I put the training code here https://github.com/Banaxi-Tech/bananamind-tts-v1-training-code
The model was trained on LJSpeech. It is a small custom TTS model, not something based on a big pretrained SSL model.
I think the results would get much better with a cleaner dataset. Speaker or language embeddings would also be interesting to try next. I also want to experiment more with the decoder/vocoder side, maybe Vocos.
If you test it or train your own version, I would be interested to see the results.
Hello again! I have successfully modified the architecture (added speaker & language embeddings), switched to API, and it indeed produces more emotional results to VITS!
I have used my 48 kHz two-speaker English dataset (but kept the embedding at more languages open for fine-tuning) for training both this architecture and Vocos (identical datasets & hyperparameters).
Skipping postnet improves quality at inference. However, one issue remains:
Vocos is trained on REAL audio (and encode-decode is audibly lossless, I’ve tested), but synthesized audio sounds a bit muffled. Do you have any ideas how this is solvable?
Hey, nice, that is really interesting.
One thing I would check first: did you change all the audio settings from my original 22.05 kHz setup to 48 kHz? My training code was made around 22.05 kHz, so if sample rate / n_fft / hop length / win length / mel f_max are still partly using the old values, that could cause exactly this kind of muffled result.
Hey, nice, that is really interesting.
One thing I would check first: did you change all the audio settings from my original 22.05 kHz setup to 48 kHz? My training code was made around 22.05 kHz, so if sample rate / n_fft / hop length / win length / mel f_max are still partly using the old values, that could cause exactly this kind of muffled result.
That may also be already from the original model/training you could try our space to see if you notice that same muffled audio https://huggingface.co/spaces/Banaxi-Tech/BananaMind-TTS-Demo.
If yes than it may be because my model is using Griffin-Lim
One thing I would check first: did you change all the audio settings from my original.
Yes, I changed everything and optimized all parameters specifically for 48 kHz.
That may also be already from the original model/training you could try our space to see if you notice that same muffled audio.
Yes, but mine will sound the same if Griffin-Lim. However, I think it made be Vocos’s issue.
I asked AI, and it said that reduction factor may be the case, but no, because even without it, generated Mel’s would still be a bit wobbly to Vocos, cause it is not used to them.
Have you seen how previous Mel-based models solved this? Cause if I remember correctly, they also just used external Vocoder!
Yeah, that makes sense.
I don’t think reduction factor is the main cause either. If the predicted mels are already a bit wobbly/smooth, then Vocos is still getting something different from the real mels it was trained on.
Older mel-based TTS models did use external vocoders, but I think the important part is that the vocoder has to survive predicted mels, not only ground-truth mels. Real-mel reconstruction being good only proves analysis/synthesis works. It does not prove the vocoder is robust to the acoustic model’s errors.
So I would probably try one of these:
train/finetune Vocos with generated/predicted mels too, not only real mels
mix real mels and predicted mels during vocoder training
add noise/jitter/smoothing augmentation to real mels when training Vocos, so it becomes less sensitive
VITS avoids some of this because it is more end-to-end, so the decoder/vocoder side is trained around the model’s own latent/audio distribution. With a separated mel model + Vocos, the acoustic model can make mels that look close by L1/MSE but still are bad inputs for the vocoder.
So yeah, I would not blame Vocos completely, but I also would not expect a Vocos trained only on real mels to handle predicted mels perfectly.
One thing I would check first: did you change all the audio settings from my original.
Yes, I changed everything and optimized all parameters specifically for 48 kHz.
That may also be already from the original model/training you could try our space to see if you notice that same muffled audio.
Yes, but mine will sound the same if Griffin-Lim. However, I think it made be Vocos’s issue.
I asked AI, and it said that reduction factor may be the case, but no, because even without it, generated Mel’s would still be a bit wobbly to Vocos, cause it is not used to them.
Have you seen how previous Mel-based models solved this? Cause if I remember correctly, they also just used external Vocoder!
Im already training one using a HIFI-GAN Right now, Probably done in 3 days, I already tested the Early One And it already sounds a lot better than Griffin LIM!
Why HiFi Gan instead of Vocos? I heard the main benefits Vocos are:
- It is faster.
- Because of modeling in frequency domain, there will be no metallic buzzing, which often occurs in voices with high f0.
- Less steps are needed for convergence :)
Yeah, Vocos has those benefits, especially speed and less metallic artifacts.
I am not really saying HiFi-GAN is better than Vocos in general. I am trying HiFi-GAN first mostly because it is a very common baseline for mel-based TTS, and I want to check if the muffled sound is really from the acoustic model or from the vocoder setup.
I'm also trying it because it's a simpler option and easier for me to experiment with right now.
If HiFi-GAN sounds much clearer on the same predicted mels, then that tells me Vocos may be more sensitive to my predicted mel distribution. If HiFi-GAN is also muffled, then the problem is probably more in the mel prediction/model side.
So for me it is more of a debugging step than a final decision. Vocos is still interesting, but HiFi-GAN is useful because many older mel-TTS models used GAN vocoders successfully, so it gives me a good comparison point.
Epoch 20 is done training, the results already look very good. https://drive.google.com/file/d/1DAtQKa-GJibECBdaH9Av7et6FND-EzKy/view?usp=sharing
If you want you can try this now, its the new one with HiFi-GAN, The Training Code Is Also Updated!
https://huggingface.co/Banaxi-Tech/BananaMind-TTS-V2
If you want you can try this now, its the new one with HiFi-GAN, The Training Code Is Also Updated!
https://huggingface.co/Banaxi-Tech/BananaMind-TTS-V2
Maybe you could try training using HiFi-GAN too, it seems extremely good right now.