Getting nan values in stage 2 training

by gargakshat99 - opened Jan 29, 2024

Jan 29, 2024

Hi I trained stage 1 for about 130 epoch. When I trained it on stage 2 it started giving nan loss values right from the beginning. Interestingly when I took the checkpoint that they have provided and loaded and unloaded the model using the first stage code, it still gave nan values even though it work fine if I directly load it to stage 2 training.

therealvul

Owner Jan 29, 2024

I have experienced this before in a few situations:

actual model parameters are not being loaded from the checkpoint (there is some weird naming error involving "module" prefix between stages 1 and 2 & whether you are using distributed vs. non-distributed training; try changing strict loading to true and see what happens with keys)
multispeaker is set incorrectly
certain batch sizes with mixed precision (try changing batch sizes)

gargakshat99

Jan 30, 2024

can you please share the config file? Ill try replilate with your parameters.

therealvul

Owner Jan 30, 2024

•

edited Jan 30, 2024

https://huggingface.co/therealvul/StyleTTS2/blob/main/Multi0/config_40_1c872.yml

This is the config file produced in 2nd stage slm adversarial training. However the strict loading change should be made in code

gargakshat99

Jan 30, 2024

Thanks alot. Do you also have the config file for that first stage?

therealvul

Owner Jan 30, 2024

Unfortunately no

gargakshat99

Feb 1, 2024

Yeah i believe that module prefix is the problem, I trained over the stage 1 checkpoint you provide and i gave the same error and as you mentioned the module was missing .

Shouldn’t strict loading give an error as it tries to strictly match the keys while loading. Does this solve the problem or do i need to add module in the keys by brute force?

therealvul

Owner Feb 1, 2024

Because the keys don't match none of the weights will actually be loaded if you disable strict loading resulting in the nan calculations. I brute force added "module" in the keys.

gargakshat99

Feb 19, 2024

Hi, i tried your model trained till 1st epoch on the infere script written in the jupiter notebook but it seems to not gebearting anything. Just blank sound. Similar thing i shappening with a model i trained too. Any idea?

therealvul

Owner Feb 19, 2024

•

edited Feb 19, 2024

What model checkpoint are you referring to specifically?
From my testing styletts2 models are very sensitive to the particular config values from training. max_len must match or it will only generate silence. During second stage diffusion training the training will also output a value for sigma_data into the generated config in the log directody, which should be the value used for inference.

gargakshat99

Feb 19, 2024

I tried epoch_1st_00067.pth

Did you infer using checkpoint just trained till first stage? The second stage checkpoints work quite well. I guess ther are probably a few models which are not yet trained in the first checkpoint that are needed for end to end generatoin using just text as input.

therealvul

Owner Feb 19, 2024

epoch_1st is not 1st epoch checkpoint, it is the 67th epoch of the 1st stage (decoder and text aligner training only). It would not be able to generate TTS

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment