OOM During Finetuning on A100 GPU

#47
by jashton1315 - opened

Hi Noelia,

Thank you very much for sharing this tool! I'm looking forward to applying it to my own projects.

I'm trying to finetune your model on a set of ~2000 sequences. However, even using a 40Gb A100 GPU results in an OOM error, with a batch size of 1.

python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file asyn_train.txt --validation_file asyn_val.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir finetune --learning_rate 1e-06 --per_device_train_batch_size 1 --low_cpu_mem_usage True

RuntimeError: CUDA error: an illegal memory access was encountered

Any help you can provide would be greatly appreciated. Thanks again!
Jonathan

Hi Noelia,

I was actually able to fix this issue. For those using a HPC to run finetuning, increasing --ntasks resolved the OOM error.

Thanks!
Jonathan

jashton1315 changed discussion status to closed
Owner

fantastic, happy to hear !

Sign up or log in to comment