Recommendations for additional pretraining?

#8
by ZQ-Dev - opened

First off, thanks for the amazing contribution!

Based on internal knowledge of the model and its training process, do you have any recommendations for users seeking to perform further domain-specific pretraining? For example, as seen here for the medical domain?

https://arxiv.org/abs/2304.14454

Hey @ZQ-Dev thanks for reaching out. We are def interested in learning more what ppl plan to build on top. Are you interested in the medical domain cont. pre-training?

Hi @daria-soboleva , thanks for the quick response! I’m targeting a different domain (not medical, feel free to DM for more context), but it’s essentially the same problem as the medical domain, i.e. heavy use of domain-specific jargon, acronyms, vocab etc. that I would like to incorporate into a model’s pretraining before instruction tuning for downstream tasks.

@ZQ-Dev gotcha, yeah that makes sense. I would recommend splitting your cont. pre-training dataset into the train and holdout to make sure that you can assess the quality without instruction based fine-tuning. Make sure that you don't have any overlap between the training set and your new holdout set from domain specific data. Feel free to use our scripts for document-level decontamination: https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama. Ideally you don't want to repeat a lot of examples from SlimPJ and your in-domain train dataset since this would not be optimal, but I think there is more hustle than an outcome, so I would not be worried about the duplicates there. In terms of the extending vocab, I believe this is still a not very-well researched part, but my recommendation would be to check how many tokens you have to add and if it is a small amount, I would just extend the vocab. Another tip would be to check the fertility score on your tokenized dataset in case you need to re-train tokenizer from scratch for your vocab. In that case you might want to make sure that old tokens and their ids are matching with the new vocab. Hope that helps.

@daria-soboleva super helpful, thank you!

Sign up or log in to comment