Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2

#24
by littleworth - opened

@nferruz Hi Noelia,

Thank you for your great work. I sincerely believe it will make a great contribution to protein biology.

I would like to try fine-tuning your ProtGPT2.
In particular, this line of code follows your model card.

python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
 --do_train --do_eval --output_dir output --learning_rate 1e-06

Can you give an example of what the content of training.txt and validation.txt looks like?

Let's say I have this fasta file that I want to turn into training.txt.

>myseq1
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
>myseq2
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Should I format it this way:

<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Or this way?

<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
<|endoftext|>
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Thanks and hope to hear from you again.

Sincerely,
Littleworth.

littleworth changed discussion title from Example of training.txt and validation.txt for fine tuning to Example of `training.txt` and validation.txt for fine tuning ProtGPT2
littleworth changed discussion title from Example of `training.txt` and validation.txt for fine tuning ProtGPT2 to Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2

Hi Littleworth,

It should be like in your second example. Please bear in mind that it must be completely like a fast format, with a newline character every 60 aminoacids:

Like this:

<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGEL
FVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWL
LHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWA
YGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE

But not like this:

<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWLLHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWAYGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE

Thank you for using ProtGPT2 and posting!
Noelia

@nferruz Thank you so much!

Hello Dear @nferruz ,

Even if my train_file is in the format of following, I got train_samples = 1 in the output. It should be 2 instead of 1 for the following example.

<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTDG
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTAG

Why does the model not get the input file correctly? Do you have any idea how to resolve it?

Hello!

I believe you mean during training? In that case, the number of samples is the number of groups of 512 tokens that are passed in batches to the model. With those two sequences you’re below 512 tokens, hence you don’t arrive to more than one sample.

Hope this helps
Noelia

Sign up or log in to comment