Why was no slavic language included in the training dataset?

#56
by brabecjan91 - opened

It would seem natural to me to include at least a few languages from the Slavic family given that there was apparently a process to make the languages in the training dataset diverse.

brabecjan91 changed discussion title from Why was no slavic language included in training dataset? to Why was no slavic language included in the training dataset?
BigScience Workshop org

A good pointer that could answer your question may be @yjernite tweet answering "why wasn't language X included in the @BigScienceLLM training data" 🤗

BigScience Workshop org

Closing as this seems to have been resolved. Thank you!

TimeRobber changed discussion status to closed

Sign up or log in to comment