GreatCaptainNemo/ProLLaMA · Need Help on Dataset preparation for Continual Learning

8 days ago

Trying a similar approach from prollama where you had implemented a two step training and tuning an llm.

step 1 - continual learning on top of a pretrained base model on your dataset using the protein sequences
step 2 - finetuning using instructions for the domain specific adaption

how can we prepare data X/Y for the continual leraning in case of sequences ?
is it like two parts of a same sequence will be considered X and Y but if we are doing these the sequence can be split into two at multiple places ?

Please help with how you prepared data for doing this continual learning on those sequences

GreatCaptainNemo

Owner 6 days ago

Hello, you can refer to "next token prediction" or "causal language modeling" for answers.
Briefly, one sequence, which consists of serveral tokens, serves as both X and Y. We predict the next token for each token. So we don't have to care about how to split one sequence.
Related codes have been integrated into various open-source libraries, such as huggingface.transformers.

GreatCaptainNemo

Owner 2 days ago

We have released our training codes here. Hope it is helpful.