Need help to debug my training process

#37
by arthurcornelio88 - opened

Hello fellows,

My friend and me, we're fine-tuning the model with our dataset. The task is very heavy for our PCs, then we passed to SageMaker. Then, we have some questions :

  1. Firstly, I would like to know if it's normal to take 5h to train it in a ml.g5.24xlarge instance, mainly because, for testing, we're using a very small dataset (ten audio files).
  2. Is it necessary to have all the demo files ? How could we understand better the params from demo_cfg ?
  3. Is there any step that we did that is not necessary - and, maybe, is causing the heavy computations ? Batch sizes, gpus, cuda stuff, etc.

We're attaching all the process of our training, to help the collective debugging.

a) the model archi

archi-sagemaker.png

In jupter notebooks:

b) first imports

jupyter1-imports.png

c) model loading

jupyter2-modelloading.png

d) Cuda import and training prompt

jupyter3-cuda et prompt.png

OUR TRAINING PROMPT :

!python stable-audio-tools-sam/train.py --model-config stable_open_model_files/model_config.json --dataset-config stable_open_model_files/dataset_config.json --name rayan-training --save-dir checkpoints --pretrained-ckpt-path stable_open_model_files/model.safetensors --batch-size 16 --num-gpus 4 --strategy deepspeed

Outputs:

e) Models loaded

jupyter4-output1.png

f) Some warnings and cuda loading

jupyter4-output2.png

g) Training in action

jupyter4-output3.png

jupyter4-output4.png

jupyter4-output5.png

h) After 5h without conclusion, our keyboard interruption...

jupyter4-output6.png

We can, eventually, put the logs from sagemaker here too.

Thanks in advance !

Sign up or log in to comment