Train datasets?

#11

by danielpark - opened Mar 29

Mar 29

I appreciate Databricks for impressive models like Dolly and Dolly-RLHF, recognizing their strengths in data and their deployment of great models based on them.

However, the latest Dbrx release lacked specific mention of the datasets used, only noting it involved a better dataset mixed with 12T of data.

Is Databricks no longer considering data disclosure, or is it published somewhere I'm unaware of?

Thank you in advance.

Qubitium

Apr 3

•

edited Apr 3

@abhi-db @hanlintang @srowen If much of dataset is proprietary, thats fine. Keep is behind closed doors. But can you guys expose at least one piece of open sourced training data used for the base model? Why, as contributor to autogptq library and working on dbrx support, we need this for optimal quant calibration. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment