Unable to finetune ProtGPT2 model

#11
by Sirisha - opened

Hello,

Thanks for the amazing work like ProtGPT2. I am relatively new to transformers and I am trying to get this package work to generate new cas protein sequences. I followed the instructions on the huggingface page and got the model working until the step to generate "de novo proteins in a zero-shot fashion". But after this step, I am unable to follow the remaining instructions. My aim is to generate new cas protein sequences using ProtGPT2 on my own dataset containing 886 sequences.

As per the instructions I substituted the beginning of the FASTA with "<|endoftext|>", generated the train, test and validation sets. But I am unable to get the command working:
"python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06"

What would be the tokenizer name here?
What do I put in place of do_train and do_eval? Sorry for these nieve questions.

Sirisha

Hi Sirisha,

Thanks a lot for writing!
What is the error you are getting when you use that command? The flags --do_train and --do_eval do not require any parameters after them, they only specify to the script that there should be training and evaluation. If you could paste your error here, I'll be able to help you - the command as you wrote it should work fine.

best & thanks
Noelia

Hi Noelia,

Thanks for your reply. The following are the things i did:

In my G-drive I created a directory into which I cloned protgpt2 github as given in huggingface versions. My own dataset is also in the same directory. I also put "run_clm.py" in the same directory. Then using Colab, I ran the following commands:

  1. pip install git+https://github.com/huggingface/transformers.git
  2. from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  3. from transformers import pipeline
    protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
    sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
    for seq in sequences:
    print(seq)
    The 3rd command gave me the new sequences following "M". So this is good
    Then I substituted the beginning of the FASTA (of my dataset; a csv file with three columns and FASTA being the second column) with "<|endoftext|>".
    Then I ran the command:
  4. from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
    model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")

Then I loaded my dataset and and generated the train, test and validation sets and ran the command:
5. python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06.
I got:
File "", line 1
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
^
SyntaxError: invalid syntax

  1. Then I changed the command as following:
    python /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/run_clm.py --nferruz/ProtGPT2 --X_train --X_val-- /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/tokenizer.json nferruz/ProtGPT2 --do_train --do_eval --/content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/output --learning_rate 1e-06

But the error persists

Sirisha changed discussion status to closed
Sirisha changed discussion status to open

Hi Sirisha,

In the second function that you post, try giving tokenizer_name the path of a folder, not a file. so ' /content/drive/MyDrive/Prot-GPT2_master/ProtGPT2/' instead.

Hope this helps.

Sign up or log in to comment