Train datasets?

#11
by danielpark - opened

I appreciate Databricks for impressive models like Dolly and Dolly-RLHF, recognizing their strengths in data and their deployment of great models based on them.

However, the latest Dbrx release lacked specific mention of the datasets used, only noting it involved a better dataset mixed with 12T of data.

Is Databricks no longer considering data disclosure, or is it published somewhere I'm unaware of?

Thank you in advance.

@abhi-db @hanlintang @srowen If much of dataset is proprietary, thats fine. Keep is behind closed doors. But can you guys expose at least one piece of open sourced training data used for the base model? Why, as contributor to autogptq library and working on dbrx support, we need this for optimal quant calibration. Thank you.

Sign up or log in to comment