Data for Continued Pre-Training

#8
by pszemraj - opened

Hi, firstly awesome work! I just wanted to check in/ask about the data use for continued pre-training:

finally, continued pre-training for the entire model.

I understand that direct sharing may not be possible, but I wanted to ask if any of the continued pre-training data was synthetically generated via OpenAI models (or any other source with similarly restrictive terms of use)?

I'm also curious, how many tokens were used for continual pretrain?

curious too. I read their paper and didn't find the details. https://browse.arxiv.org/html/2312.15166v1

upstage org

Details of Data Sets and Training Techniques: Thank you for your interest! Unfortunately, due to the high level of competition in this field, we are unable to share detailed information about the training techniques and datasets used. We appreciate your understanding. However, we have released a list of fine-tuning datasets, https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0.

hunkim changed discussion status to closed

how much data is used in continue pretraining?

@hunkim thanks! understood. I'm primarily interested in this checkpoint upstage/SOLAR-10.7B-v1.0 as it is apache-2.0 - based on your response it seems like you all have done your homework. I assume there is no issue using upstage/SOLAR-10.7B-v1.0 to the fullest extent of it's apache-2.0 license, including synthetic data generation, commercial use, etc. Please advise if my interpretation is incorrect & thanks again.

upstage org

@pszemraj You are right! Enjoy!

Sign up or log in to comment