Resources required for fine tuning Protgpt2

#36

by ChetnaA - opened Nov 20, 2023

Nov 20, 2023

I'm in the process of fine-tuning the protGPT-2 model using a set of training examples I've collected. I have about 500k sequences and am thinking of starting with a small subset of these, say 2k sequences. I have access to two instances of NVIDIA A100 GPUs, each with 40 GB of memory. Will this be enough for training? Will I need any kind of memory saving tricks? Also, can someone give me an idea of how long it'll take to run the fine tuning for different number of sequences (between 2k and 500k)?
Thanks!

nferruz

Owner Nov 21, 2023

Hi,

Your resources sound like more than plenty. With 2k you wouldn’t need more than one hour. With 5000k it will take longer, but using 2 A100s I don’t imagine it takes you longer than a day to run several epochs (but I’m doing very simple maths here). If you want to save some vRAM you could use deep speed. In fact I recommend it it’s super easy to use!

ChetnaA

Nov 21, 2023

Hi,
Thanks for the quick response. I'll definitely look into deep speed library! One more question, will V100 GPUs also work for this fine tuning task?

nferruz

Owner Nov 21, 2023

I haven't tried them myself but they should, absolutely.

ChetnaA

Nov 22, 2023

Great, thanks! I'll try them out!

ekiefl

14 days ago

Hi @nferruz ,

Are there any back of the envelope calculations for estimating how much memory will be required for a job?

I'm asking because finetuning the model has required a lot more memory than we expected.

We are finetuning with around 2250 training sequences (~300 AA long) and ran into OOM errors on the following machines:

4 V100s (16Gb VRAM each)
A100 (40Gb VRAM)

The job ran (in about 5 minutes) on these machines:

A100 (80Gb VRAM)
H100 (80Gb VRAM)

For the 80Gb VRAM GPUs, utilization was close to 100%

Details:

We followed the instructions provided on the hugging face page. In essence, this is our environment setup:

NAME=protgpt
conda deactivate
conda deactivate
conda deactivate
conda deactivate
conda env remove --name $NAME
conda create -y -n $NAME python=3.10
conda activate $NAME

pip install git+https://github.com/huggingface/transformers

# Reqs for running `run_clm.py`
pip install -r <(cat << EOF
accelerate >= 0.12.0
torch >= 1.3
datasets >= 2.14.0
sentencepiece != 0.1.92
protobuf
evaluate
scikit-learn
EOF
)

conda install pytorch cudatoolkit -c pytorch -c nvidia -y

wget https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

Our run command is:

python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file dck_sequences_train.txt --validation_file dck_sequences_val.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06

We haven't yet experimented with deep speed, as you recommend above. If you have a working example, that would be much appreciated!

And if you would like to experiment with the training/validation data yourself, I've sent them in an email.

Thanks again for your time.

Cheers,
Evan

nferruz

Owner 14 days ago

What is the batch size that the script uses by default? I'd decrease it considerably, you should be able to fine-tune the model in all these cards (or most of them).

ekiefl

6 days ago

We've experimented with the batch size. In the above specs, the default batch size is 8. On 24GB of VRAM, a batch size of 2 is possible, but 3 is not. Is this in line with your expectations?

The workaround we've opted for is using a batch size of 1 (--per_device_train_batch_size 1). Since this leads to noisy learning that's hard to generalize, we've simulated a batch size of 16 with --gradient_accumulation_steps 16.

nferruz

Owner 5 days ago

Hi, yes, that sounds reasonable depending on your GPU. Doing gradient accumulation also sounds fine if you have no other option, as you pay the price of lower performance to overcome the memory problem: https://discuss.huggingface.co/t/batch-size-vs-gradient-accumulation/5260/5.
But in all honesty we use it quite frequently as well.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment