Update README.md
Browse files
README.md
CHANGED
@@ -194,18 +194,19 @@ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the
|
|
194 |
|
195 |
### Pretraining Data
|
196 |
|
197 |
-
The training corpus
|
198 |
-
|
199 |
-
and
|
|
|
|
|
|
|
200 |
|
201 |
![lang distrib](./images/corpus_languages.png)
|
202 |
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
208 |
-
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
209 |
The remaining 10% comes from smaller sources in various languages.
|
210 |
|
211 |
Feel free to click the expand button below to see the full list of sources.
|
@@ -344,8 +345,9 @@ To consult the data summary document with the respective licences, please send a
|
|
344 |
|
345 |
</details>
|
346 |
|
347 |
-
The model was trained
|
348 |
-
|
|
|
349 |
|
350 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
351 |
|
@@ -379,6 +381,9 @@ and public institutions, which can be found in detail in the acknowledgements.
|
|
379 |
|
380 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
381 |
|
|
|
|
|
|
|
382 |
#### Composition
|
383 |
|
384 |
**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.**
|
@@ -402,10 +407,10 @@ We provide a complete list of dataset sources at the end of this section.
|
|
402 |
**How many instances are there in total (of each type, if appropriate)?**
|
403 |
|
404 |
The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
|
405 |
-
represents the largest portion, accounting for 39.
|
406 |
-
while Catalan (1.
|
407 |
-
by half, making up
|
408 |
-
(
|
409 |
|
410 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
411 |
|
|
|
194 |
|
195 |
### Pretraining Data
|
196 |
|
197 |
+
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
198 |
+
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
199 |
+
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
200 |
+
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
201 |
+
Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
202 |
+
This adjustment resulted in a total of 2.08 trillion tokens, distributed as outlined below:
|
203 |
|
204 |
![lang distrib](./images/corpus_languages.png)
|
205 |
|
206 |
+
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
207 |
+
Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
208 |
+
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
209 |
+
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
|
|
|
|
210 |
The remaining 10% comes from smaller sources in various languages.
|
211 |
|
212 |
Feel free to click the expand button below to see the full list of sources.
|
|
|
345 |
|
346 |
</details>
|
347 |
|
348 |
+
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
349 |
+
of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.08T tokens per epoch;
|
350 |
+
and 1 final round of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 11.675 trillion tokens.
|
351 |
|
352 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
353 |
|
|
|
381 |
|
382 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
383 |
|
384 |
+
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
385 |
+
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
386 |
+
|
387 |
#### Composition
|
388 |
|
389 |
**What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.**
|
|
|
407 |
**How many instances are there in total (of each type, if appropriate)?**
|
408 |
|
409 |
The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
|
410 |
+
represents the largest portion, accounting for 39.31% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.12%,
|
411 |
+
while Catalan (1.97%), Basque (0.24%), and Galician (0.31%) were also upsampled by 2. On the other hand, code-related data was downsampled
|
412 |
+
by half, making up 5.78% of the total. Other prominent languages include French (6.6%), Russian (5.56%), German (4.79%), and Hungarian
|
413 |
+
(4.59%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
|
414 |
|
415 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
416 |
|