Hanging after finishing training on T4 medium in HuggingFace Spaces

#3
by Ep0ch - opened

I am attempting to run multi-concept training (2 persons, specifically) and was finally able to get training to work (CompVis v1.4, 512 pixels, on T4 medium in a private, duplicated Space on HuggingFace) after doing all the things that lower memory usage. Previously I was running out of memory. After 1500 training steps, this is the final output of the container logs. I guess I was expecting a "Training done" confirmation output?

Steps: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1500/1500 [58:16<00:00, 2.34s/it, loss=0.669, lr=1e-5]

Fetching 16 files: 0%| | 0/16 [00:00Fetching 16 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 22236.20it/s]

Steps: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1500/1500 [58:25<00:00, 2.34s/it, loss=0.366, lr=1e-5]

Fetching 16 files: 0%| | 0/16 [00:00Fetching 16 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 23045.63it/s]

Steps: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1500/1500 [58:46<00:00, 2.35s/it, loss=0.366, lr=1e-5]

Nothing showed up in the "Trained Weight Files" on the "Train" tab as well. Additionally, the newly trained model did not show up in the Test section, even after ~15 minutes of waiting, and hitting the "Reload Weight List" button does not make the trained model show up.

I'm going to retry with exact specs on an A10 small. Hopefully it doesn't get hung up right at the end again.

Edit: Training went 3x faster on the A10 small, but it still hung up right at the end, same exact output that never proceeds. I'm going to try training again but not hit the "Check Training Status" button that seemingly does nothing.

Edit2: Nope, didn't touch anything, just watched the Logs the entire time, and it always hangs at that last print statement, and doesn't do anything after that. I'm going to try single concept training.

Edit3: I tried with single-concept, using the suggested token. It trained for 1000 steps (in 10 minutes), and then I waited more than 20 minutes after, and nothing happens. It always hangs at the print statement that I posted above. This Space does not seem to be in a functioning state at the moment.

Hi,
Thanks a lot for trying our demo. I have attached below a fast-forward video of the training run in another private duplicated space with an A10G.
I agree that there is some issue in the "Trained Weight Files" on the "Train" tab, and I am looking into that. Thanks for bringing it up.
Apart from that I am able to train, inference, and upload the weight files without any issue.

Please let me know if you are able to train successfully now.

@nupurkmr9 mine were running over 17k seconds and 0 messages etc

image.png

The problem must be happening in the "Reload Weight List" button.

I reduplicated the Custom-Diffusion Space to a private space. I then loaded my multi-concept set up. It got to the end of training 1500 steps. After hitting the "Reload Weight List" button, there is only ever one option in the "Custom Diffusion Weight File": custom-diffusion-models/cat.bin. I see in the repo there are around 10 default ones, but even those aren't loading/available to select. I'm unable to do any sort of inference using the Test tab actually (there's always some error. Maybe you must do some training before doing any testing?)

However I can go to the Upload tab and type in a name of the model (and the HuggingFace key) and it does upload something to my HuggingFace profile, including a delta.bin file. This must be the fully trained model because contains the training data for each concept as well as the checkpoints at 500, 1000, and 1500.

If I'm using AUTOMATIC1111's WebUI locally, is there a tutorial/guide somewhere that I could download the model off HuggingFace and then use it in WebUI? That's very

Hi @Ep0ch and @MonsterMMORPG ,
I changed some of the default settings e.g. batch-size to 1 from 2 and reduced number of steps which might have been causing issues during inference. I am able to run inference with default models without doing any testing as well as in the below video.
Also input concept prompt as a full sentence e.g. "photo of a <new1> cat" is expected during the training.
Regarding using custom-diffusion models with AUTOMATIC111's webUI, I will soon update the demo and our github code with the option to convert checkpoints that can be used there.

Thanks for testing out the demo!!

Hi @Ep0ch and @MonsterMMORPG ,
I changed some of the default settings e.g. batch-size to 1 from 2 and reduced number of steps which might have been causing issues during inference. I am able to run inference with default models without doing any testing as well as in the below video.
Also input concept prompt as a full sentence e.g. "photo of a <new1> cat" is expected during the training.
Regarding using custom-diffusion models with AUTOMATIC111's webUI, I will soon update the demo and our github code with the option to convert checkpoints that can be used there.

Thanks for testing out the demo!!

awesome

if you can make demo for automatic1111 i would like to cover a tutorial video on my channel hopefully : https://www.youtube.com/@SECourses

Sign up or log in to comment