Dataset for reproduction

#7
by ahans1 - opened

Do you plan to release the dataset for reproduction of this training run? I know you have released the cosmopedia dataset which has 25B tokens out of 30B, but can you release the exact split for non-synthetic 5B tokens used for this model?

Sign up or log in to comment