pablo-rf commited on
Commit
f4d7257
1 Parent(s): 76126c8

[ADD] Corpus Information

Browse files
Files changed (1) hide show
  1. README.md +24 -2
README.md CHANGED
@@ -101,7 +101,7 @@ widget:
101
  ## Model description
102
 
103
  **FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
104
- It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos]().
105
 
106
  ## Intended uses and limitations
107
 
@@ -153,7 +153,29 @@ The language adaptation technique used to train FLOR-1.3B-GL is based in the use
153
 
154
  ### Training data
155
 
156
- TO-DO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
 
159
  ### Training hyperparameters
 
101
  ## Model description
102
 
103
  **FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
104
+ It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
105
 
106
  ## Intended uses and limitations
107
 
 
153
 
154
  ### Training data
155
 
156
+ [CorpusNÓS](https://zenodo.org/records/10687642 ) is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.
157
+
158
+ The corpus is structured as follows:
159
+
160
+ | Subcorpus | Genre | Nº tokens | Nº documents |
161
+ |---------------------------------------|---------------------|----------------|--------------|
162
+ | Data obtained via transfer agreement | Books | 7,255,784 | 104 |
163
+ | | Research articles | 2,665,351 | 664 |
164
+ | | Press | 124,253,084 | 224,419 |
165
+ | | Governmental | 245,897,880 | 654,505 |
166
+ | | Web contents | 15,946,686 | 44,165 |
167
+ | | Encyclopedic | 4,799,214 | 47,396 |
168
+ | | Subtotal | 400,817,999 | 971,253 |
169
+
170
+ | Subcorpus | Genre | Nº tokens | Nº documents |
171
+ |---------------------------------------|---------------------|----------------|--------------|
172
+ | Public data | Press and blogs | 153,497,883 | 665,265 |
173
+ | | Encyclopedic | 57,164,848 | 184,628 |
174
+ | | Web crawls | 1,384,015,664 | 3,366,449 |
175
+ | | Translation corpora | 133,726,004 | 4,745,799 |
176
+ | | Subtotal | 1,728,404,399 | 8,777,514 |
177
+ | | Total | 2,129,222,398 | 9,748,767 |
178
+ | Download (Zenodo) | https://zenodo.org/records/10687642 |
179
 
180
 
181
  ### Training hyperparameters