Fine-tuning ProLLaMA on custom dataset

#1
by mrshah24 - opened

Hi, thank you for this wonderful work on protein language modeling! I have a csv dataset consisting of only non-hemolytic protein sequences as shown in the picture and I want to use it to fine-tune ProLLaMA to generate only non-hemolytic proteins. Could you provide the steps I should take to do that? How should I add the conditions? Thank you!

image.png

Thanks for your attention! Suppose 'xxx' denotes one of your sequence. You could convert your raw data into:

[Generate non-hemolytic protein] Seq=<xxx>

You can then train ProLLaMA on the processed dataset using some existing code base (like HF trainer).
I am also planning to open source the training code on my github.

Thank you for answering my query, I will try the approach and also looking forward to the training code as well! Could you also tell the time it took to train ProLLaMA, for both continual training and instruction fine-tuning, and how many GPUs were used? It will help for my research as well.

Sure. For stage 1, it takes about 6 days on 8 A6000 GPUs. For stage 2, it takes 5 days on 8 A6000 GPUs. Flash-attention2 and DeepSpeed can speed up the training.

Got it, thank you!

Hello, we have released our training codes here.

Sign up or log in to comment