Can I continue pretraining this model for domain adaptation?

#6
by sadahila - opened

Hi,

Can I continue pretraining this model for text generation for domain adaptation purposes? Do you know if this would loose it's instruct tuned abilities?

Thanks in advance

Together org

Hi @sadahila , that's an interesting question!

I think it mostly depends on for how long / how many tokens you continue training the model. Best is of course if you have instruction data for your domain -- what would also be interesting is to see whether it is possible to generate such a dataset using instruction backtranslation (https://arxiv.org/pdf/2308.06259.pdf). Let me know if you would try that!

Thanks for the paper suggestion! I can also generate instruction data from our domain. However, I'm wondering if there is some benefit to doing Causal Language Modeling training first with some unsupervised data for domain adaptation and then doing instruction finetuning with our domain specific data. Or would it be advised that I start with the Llama-2-7B-32K instead of the Instruct version?

Thanks.

Together org

I see, I think it's definitely worth a try to build on Llama-2-7B-32K if you have enough instruction tuning data for your domain (for comparison, the llama-instruct dataset has 19k samples). This way you would end up with a base model finetuned to your domain and then also instruction tuned. Out of curiosity, what is your technique to generate instructions?

Thanks for advice. I actually just add a header to our data similar to the dolly-instruct dataset: "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction:"

sadahila changed discussion status to closed

Sign up or log in to comment