[ADD] Corpus Information
Browse files
README.md
CHANGED
@@ -101,7 +101,7 @@ widget:
|
|
101 |
## Model description
|
102 |
|
103 |
**FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
|
104 |
-
It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos]().
|
105 |
|
106 |
## Intended uses and limitations
|
107 |
|
@@ -153,7 +153,29 @@ The language adaptation technique used to train FLOR-1.3B-GL is based in the use
|
|
153 |
|
154 |
### Training data
|
155 |
|
156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
157 |
|
158 |
|
159 |
### Training hyperparameters
|
|
|
101 |
## Model description
|
102 |
|
103 |
**FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
|
104 |
+
It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
|
105 |
|
106 |
## Intended uses and limitations
|
107 |
|
|
|
153 |
|
154 |
### Training data
|
155 |
|
156 |
+
[CorpusNÓS](https://zenodo.org/records/10687642 ) is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.
|
157 |
+
|
158 |
+
The corpus is structured as follows:
|
159 |
+
|
160 |
+
| Subcorpus | Genre | Nº tokens | Nº documents |
|
161 |
+
|---------------------------------------|---------------------|----------------|--------------|
|
162 |
+
| Data obtained via transfer agreement | Books | 7,255,784 | 104 |
|
163 |
+
| | Research articles | 2,665,351 | 664 |
|
164 |
+
| | Press | 124,253,084 | 224,419 |
|
165 |
+
| | Governmental | 245,897,880 | 654,505 |
|
166 |
+
| | Web contents | 15,946,686 | 44,165 |
|
167 |
+
| | Encyclopedic | 4,799,214 | 47,396 |
|
168 |
+
| | Subtotal | 400,817,999 | 971,253 |
|
169 |
+
|
170 |
+
| Subcorpus | Genre | Nº tokens | Nº documents |
|
171 |
+
|---------------------------------------|---------------------|----------------|--------------|
|
172 |
+
| Public data | Press and blogs | 153,497,883 | 665,265 |
|
173 |
+
| | Encyclopedic | 57,164,848 | 184,628 |
|
174 |
+
| | Web crawls | 1,384,015,664 | 3,366,449 |
|
175 |
+
| | Translation corpora | 133,726,004 | 4,745,799 |
|
176 |
+
| | Subtotal | 1,728,404,399 | 8,777,514 |
|
177 |
+
| | Total | 2,129,222,398 | 9,748,767 |
|
178 |
+
| Download (Zenodo) | https://zenodo.org/records/10687642 |
|
179 |
|
180 |
|
181 |
### Training hyperparameters
|