Disclosure of training data needed

by markding - opened

What did the "4.5T tokens of high-quality training data during the training phase" exactly consist of? Knowing this is important to interpret evaluation results but also to understand potential legal aspects of deployment.

Names of specific datasets and identification of languages would be very useful.

Thank you for your work!

ZekeWang changed discussion status to closed

Closed without even a comment?

Sign up or log in to comment