Please document pretraining datasets

#49
by markding - opened

It is surprisingly hard to find which datasets were used in pretraining. Could you provide more details? The dataset statement on the Cohere site looks like it applies to GPT models that are now several years old; it offers no details either: https://docs.cohere.com/docs/data-statement

At https://opening-up-chatgpt.github.io/ we're tracking degrees of openness for instruction-tuned LLMs that are made openly available in some form (in the case of Command R+, the model weights are made available). FWIW, Command R+ joined the bottom 5 (out of >30 models currently tracked), just below Llama2.

image.png

Providing more details on this will soon be required under the EU AI Act.

Sign up or log in to comment