Minimal number of sequences for fine tuning

#2
by ShanGao - opened

Dear Authors,

Thanks for your excellent work! I plan to fine tune the model. Can you please advice on the minimal number of protein sequences needed for fine tuning? Thanks a lot!

Hi Yuanji Zhang,

Thanks for your interest! I am afraid I do not have a rule of thumb yet. I tried to fine-tune a model with 500 sequences, and it did not work very well. However, I know someone who fine-tuned 900 sequences, and the training curves looked fine and obtained the expected results. So I guess you will have to try :). I am happy to assist if you need any help!

Noelia

Hi Noelia,

I tested the example code from zenodo as

protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
sequences = protgpt2("M", max_length=100, min_length=80, ...)

The actual length of generated protein sequences is from 239..298. Are "max_length" and "min_length" actually the number of tokens?

Yes, min and max length correspond to the number of tokens

Sign up or log in to comment