Need help to debug my training process

#37

by arthurcornelio88 - opened Jul 8, 2024

Discussion

arthurcornelio88

Jul 8, 2024

•

edited Jul 9, 2024

Hello fellows,

My friend and me, we're fine-tuning the model with our dataset. The task is very heavy for our PCs, then we passed to SageMaker. Then, we have some questions :

Firstly, I would like to know if it's normal to take 5h to train it in a ml.g5.24xlarge instance, mainly because, for testing, we're using a very small dataset (ten audio files).
Is it necessary to have all the demo files ? How could we understand better the params from demo_cfg ?
Is there any step that we did that is not necessary - and, maybe, is causing the heavy computations ? Batch sizes, gpus, cuda stuff, etc.

We're attaching all the process of our training, to help the collective debugging.

a) the model archi

In jupter notebooks:

b) first imports

c) model loading

d) Cuda import and training prompt

OUR TRAINING PROMPT :

!python stable-audio-tools-sam/train.py --model-config stable_open_model_files/model_config.json --dataset-config stable_open_model_files/dataset_config.json --name rayan-training --save-dir checkpoints --pretrained-ckpt-path stable_open_model_files/model.safetensors --batch-size 16 --num-gpus 4 --strategy deepspeed

Outputs:

e) Models loaded

f) Some warnings and cuda loading

g) Training in action

h) After 5h without conclusion, our keyboard interruption...

We can, eventually, put the logs from sagemaker here too.

Thanks in advance !

PsiPi

Oct 7, 2024

•

edited Oct 7, 2024

hours? hmm with 4 gpus... I heard they put something like 16000 gpu hours just into the vae.

https://github.com/yukara-ikemiya/friendly-stable-audio-tools
give that an eyball maybe.

for what its worth your code seemed okay.

after 10k steps it should drop out a ckpt and keep the best 2 of those models (IIRC) and do that every 10k

PsiPi

Oct 7, 2024

After 5h without conclusion, our keyboard interruption...

it will NEVER conclude.

the code says maximum epochs 100000 (or something insane). You are supposed to stop it AFTER a 10k moment (like 10k or 100k or 200k however many steps you want)

when it has JUST spat out a ckpt is my preferred time.

If you exit early

Define the path where you want to save the model

model_save_path = os.path.join(OUTPUT_DIR, 'final_model_checkpoint.ckpt')

Save the model using the trainer

trainer.save_checkpoint(model_save_path)

print("Model saved successfully at:", model_save_path)

I believe that can - in a pinch - rip out the traned model so far.
But better off doing what I said before. (since this is just some last ditch code I made up one time)

arthurcornelio88

Oct 9, 2024

Hello,

Effectively, I put a lower number for max_epochs and I'm getting the model after a keyboard interruption, your solution seems like mine's, thanks for confirming me my insight !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment