Unable to finetune ProtGPT2 model

#11

by Sirisha - opened Jan 9, 2023

Jan 9, 2023

Hello,

Thanks for the amazing work like ProtGPT2. I am relatively new to transformers and I am trying to get this package work to generate new cas protein sequences. I followed the instructions on the huggingface page and got the model working until the step to generate "de novo proteins in a zero-shot fashion". But after this step, I am unable to follow the remaining instructions. My aim is to generate new cas protein sequences using ProtGPT2 on my own dataset containing 886 sequences.

As per the instructions I substituted the beginning of the FASTA with "<|endoftext|>", generated the train, test and validation sets. But I am unable to get the command working:
"python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06"

What would be the tokenizer name here?
What do I put in place of do_train and do_eval? Sorry for these nieve questions.

Sirisha

nferruz

Owner Jan 10, 2023

Hi Sirisha,

Thanks a lot for writing!
What is the error you are getting when you use that command? The flags --do_train and --do_eval do not require any parameters after them, they only specify to the script that there should be training and evaluation. If you could paste your error here, I'll be able to help you - the command as you wrote it should work fine.

best & thanks
Noelia

Sirisha

Jan 10, 2023

Hi Noelia,

Thanks for your reply. The following are the things i did:

In my G-drive I created a directory into which I cloned protgpt2 github as given in huggingface versions. My own dataset is also in the same directory. I also put "run_clm.py" in the same directory. Then using Colab, I ran the following commands:

pip install git+https://github.com/huggingface/transformers.git
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
for seq in sequences:
print(seq)
The 3rd command gave me the new sequences following "M". So this is good
Then I substituted the beginning of the FASTA (of my dataset; a csv file with three columns and FASTA being the second column) with "<|endoftext|>".
Then I ran the command:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")

Then I loaded my dataset and and generated the train, test and validation sets and ran the command:
5. python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06.
I got:
File "", line 1
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
^
SyntaxError: invalid syntax

Then I changed the command as following:
python /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/run_clm.py --nferruz/ProtGPT2 --X_train --X_val-- /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/tokenizer.json nferruz/ProtGPT2 --do_train --do_eval --/content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/output --learning_rate 1e-06

But the error persists

Sirisha changed discussion status to closed Jan 10, 2023

Sirisha changed discussion status to open Jan 10, 2023

nferruz

Owner Jan 14, 2023

Hi Sirisha,

In the second function that you post, try giving tokenizer_name the path of a folder, not a file. so ' /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/' instead.

Hope this helps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment