Questions regarding fine tuning protgpt2

#35
by xenidisv - opened

Hi, I am a beginner in the field and I am sorry in advance for my naive questions.

I want to fine tune protgpt2 and I have adopted the instructions like this. In a Jupyter notebook, I run these commands:

pip install evaluate
pip install transformers[torch]
pip install accelerate -U
pip install datasets --upgrade
pip install --upgrade protobuf
pip install --upgrade wandb
pip install git+https://github.com/huggingface/transformers.git

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline
import os
os.environ["WANDB_DISABLED"] = "true"
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")

I have created the filtered_training_list.txt and filtered_validation_list.txt

!python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file filtered_training_list.txt --validation_file filtered_validation_list.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output_dir_test --learning_rate 1e-06 --low_cpu_mem_usage --per_device_train_batch_size 2 --per_device_eval_batch_size 2

When I run this in a jupyter notebook, my terminal crashes (probably due to cpu/memory usage?). When I run it in google colab though it was successful. Searching in the google drive, I found the output_dir_test, but its an extremely small file (88 bytes). I opened it with a text editor and it said this:

brain.Event:2R.
,tensorboard.summary.writer.event_file_writer (+plus 12 other binary characters that cannot be pasted here)

My questions are:

  1. Do you have any suggestions to what am I doing wrong?
  2. How can I run the finetuned model once I have generated it (to generate new sequences like in example 1)?
  3. While searching on how to finetune the model, I came across to the button train, Amazon sagemaker. Is there a tutorial that I could follow in order to finetune protgpt2?

Thank you in advance

Kind regards,
Bill

Screenshot_20231110_211950.png

Hi,

  1. I’ve unfortunately never seen this error; But I also never fine-tuned on Google Colab (and your PC probably crashes most likely to a lack of memory, do you get any out-of-memory errors?)

  2. Once you fine-tune the model, replace nferruz/ProtGPT2 in your command with the path to the new model and run it again.

  3. About Amazon Sagemaker, no, I haven’t written such tutorial, but, if you find any tutorials about how to fine-tune gpt2 or any other autoregressive LLM, that should work for protgpt2 too (making sure you input your formatted training and val txt files).

Hi, thank you for your reply.

  1. No, the terminal in the laptop just crashes and it closes, it doesnt print anything. In jupyter it also does the same thing, it freezes.

In hugging face learning tutorials it mentions preprocessing my data, and using tokenizers to the new preprocessed data.

a) Is this required with your model? Are the steps (code-pipeline) that I follow in my previous answer, correct or am I missing a basic step? I also used --low_cpu_mem_usage, is it correct? Should I edit the run_clm.py file?

b) Also, regarding the training and validation files, I have downloaded from a database, peptide sequences (in fasta format) and because they are almost 3000 peptides, I have separated 80% of them in filtered_training_list.txt and 20% in the filtered_validation.txt. Moreover, I have replaced the header and the generated txt file. It looks like this, Is it correct? :
<|endoftext|> /n
LR..... etc /n
/n
<|endoftext|> /n
RS..... etc /n

each peptide sequence is 59 amino acids long and then it breaks in a new line, eg:

<|endoftext|> /n
RS..... etc /n
LR..... /n
SR.... /n
/n
<|endoftext|> /n
LRSFJT..... etc /n
/n

c) Last but not least, inside the generated folder (output_dir), there is a folder called (runs) and inside this directory there is only the previously mentioned (events.out.... etc) file. In case of a correctly finetuned model, which are the expected directories and files (how they should look like and how big are they)?

Thank you in advance
Kind regards
Bill

Sign up or log in to comment