BLOOM training languages inconsistencies

#47
by Muennighoff - opened
BigScience Workshop org
  1. Lingala is mentioned in the language codes (ln), but not in the Distribution of Niger Congo and Indic languages. part
  2. What is our definition for a training language? The training data has 341B tokens, so at 0.00002%, there are ~68200 tokens of Chi Tumbuka in there. In English 1 token ~ 0.75 words (a)) and 1 sentence ~ 20 words (b)). If it's ~the same in Chi Tumbuka, the model has seen 2500 sentences of that language, as we did ~1 epoch. Do we have statistics for how much German or Japanese accidentally landed in the training data? I'd guess it's more than Chi Tumbuka, so where do we draw the line between a "training language"? We don't have any evaluation datasets for Chi Tumbuka afaik, so if we don't get on one of its 3M speakers we'll never know if it knows it at all. 👻

a) https://openai.com/api/pricing/#faq-token
b) https://www.quora.com/What-is-the-average-number-of-words-per-sentence-in-common-writings-such-as-news-articles-and-college-essays

cc @lintang @cakiki

BigScience Workshop org

Sorry for the late reply Niklas! I somehow was not notified I was mentioned.

  1. We have 1,650,804 bytes of Lingala data; not sure why it's not listed. Good catch! I will have a look. For reference:

image.png

  1. Estimating how many languages accidentally made their way into the corpus is a very interesting question (This is probably mostly affects the catalogue). I think we have to explicitly quantify this; I will bring this up in tomorrow's Viz and Analysis meeting. Now the question about whether a certain threshold might qualify a language into being considered a "training language" is both fascinating and difficult; I don't have an answer right now.

cc @yjernite @aymm

BigScience Workshop org

Hi Niklas, thanks for catching the oversight!

That's an interesting question, and there are indeed several good definitions of what should be considered a training language (or what constitutes a language really!).

In our case, much of the philosophy of how we built the training corpus was constructive: identify languages we wanted to represent a priori, find participants who had expertise in those languages, select some sources of data in the language of interest, and make choices about how each of these languages was pre-processed. We are taking a process-driven perspective here, and identifying training languages as the ones for which we followed this approach. Conversely, German and Japanese did not receive the same level of attention or intentionality, so we are not listing them.

This brings us to the question of evaluation and reporting the model's performance (and whether they come from the training data or from cross-lingual transfer). We do have evaluations that include German at least (I'm not sure about Japanese), so we will have to qualify those results by pointing to the German text in the training data. We are also missing evaluations for some of the languages we intentionally included. We do want to have evaluations for all of the Niger-Congo languages eventually, beyond the 9 that are currently in FLORES - signaling that there is an LLM out there trained on as much data as was available in 2022 for each of these languages (we really looked!) will hopefully help make that happen.

TLDR; in the context of describing the training corpus and collaborative approach, listing intentionally selected languages makes more sense even if it can conflict with how we think about training languages for evaluation. We'll have to keep analyzing our corpus and document the "language leakage" to make sure evaluation results are properly interpreted.

BigScience Workshop org

Great thoughts, thanks a lot @yjernite @cakiki ! Indeed FLORES-200 has all of our training languages except for code so we can evaluate them all on it 👍
I opened a PR here for Lingala: https://huggingface.co/bigscience/bloom/discussions/60 😇

Muennighoff changed discussion status to closed

Sign up or log in to comment