jsaizant commited on
Commit
ed8ec15
·
verified ·
1 Parent(s): 541280f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -15
README.md CHANGED
@@ -194,18 +194,19 @@ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the
194
 
195
  ### Pretraining Data
196
 
197
- The training corpus consists of 2.4 trillion tokens, including 35 European languages and 92 programming languages. It amounts to a total of 33TB of pre-processed text.
198
- Languages were sampled manually by giving x2 oversampling to Spain's co-official languages (Spanish, Catalan, Galician and Basque), code was undersampled by half,
199
- and the rest of the languages were kept as is, resulting in the following distribution:
 
 
 
200
 
201
  ![lang distrib](./images/corpus_languages.png)
202
 
203
- This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
204
- which contributes a significant 66.06% of the total tokens.
205
- Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
206
- The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
207
- Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
208
- These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
209
  The remaining 10% comes from smaller sources in various languages.
210
 
211
  Feel free to click the expand button below to see the full list of sources.
@@ -344,8 +345,9 @@ To consult the data summary document with the respective licences, please send a
344
 
345
  </details>
346
 
347
- The model was trained for 3 epochs, with two final rounds of 0.3B higher-quality tokens each,
348
- meaning that the total number of tokens seen during pre-training amounts to roughly 7.8 trillion tokens.
 
349
 
350
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
351
 
@@ -379,6 +381,9 @@ and public institutions, which can be found in detail in the acknowledgements.
379
 
380
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
381
 
 
 
 
382
  #### Composition
383
 
384
  **What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.**
@@ -402,10 +407,10 @@ We provide a complete list of dataset sources at the end of this section.
402
  **How many instances are there in total (of each type, if appropriate)?**
403
 
404
  The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
405
- represents the largest portion, accounting for 39.08% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.59%,
406
- while Catalan (1.84%), Basque (0.26%), and Galician (0.36%) were also upsampled by 2. On the other hand, code-related data was downsampled
407
- by half, making up 6.42% of the total. Other prominent languages include French (6.59%), Russian (5.39%), German (4.25%), and Hungarian
408
- (3.93%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
409
 
410
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
411
 
 
194
 
195
  ### Pretraining Data
196
 
197
+ The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
198
+ The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
199
+ and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
200
+ Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
201
+ Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
202
+ This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
203
 
204
  ![lang distrib](./images/corpus_languages.png)
205
 
206
+ The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
207
+ Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
208
+ Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
209
+ These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 
 
210
  The remaining 10% comes from smaller sources in various languages.
211
 
212
  Feel free to click the expand button below to see the full list of sources.
 
345
 
346
  </details>
347
 
348
+ The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
349
+ of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
350
+ and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
351
 
352
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
353
 
 
381
 
382
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
383
 
384
+ This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
385
+ within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
386
+
387
  #### Composition
388
 
389
  **What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.**
 
407
  **How many instances are there in total (of each type, if appropriate)?**
408
 
409
  The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
410
+ represents the largest portion, accounting for 39.31% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.12%,
411
+ while Catalan (1.97%), Basque (0.24%), and Galician (0.31%) were also upsampled by 2. On the other hand, code-related data was downsampled
412
+ by half, making up 5.78% of the total. Other prominent languages include French (6.6%), Russian (5.56%), German (4.79%), and Hungarian
413
+ (4.59%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
414
 
415
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
416