proxectonos
/

Carballo-bloom-1.3B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

pablo-rf commited on Mar 5, 2024

Commit

f4d7257

•

1 Parent(s): 76126c8

[ADD] Corpus Information

Files changed (1) hide show

README.md +24 -2

README.md CHANGED Viewed

@@ -101,7 +101,7 @@ widget:
 ## Model description
 **FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
-It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos]().
 ## Intended uses and limitations
@@ -153,7 +153,29 @@ The language adaptation technique used to train FLOR-1.3B-GL is based in the use
 ### Training data
-TO-DO
 ### Training hyperparameters

 ## Model description
 **FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
+It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
 ## Intended uses and limitations
 ### Training data
+[CorpusNÓS](https://zenodo.org/records/10687642	) is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.
+The corpus is structured as follows:
+| Subcorpus                             | Genre               | Nº tokens      | Nº documents |
+|---------------------------------------|---------------------|----------------|--------------|
+| Data obtained via transfer agreement  | Books               | 7,255,784      | 104          |
+|                                       | Research articles   | 2,665,351      | 664          |
+|                                       | Press               | 124,253,084    | 224,419      |
+|                                       | Governmental        | 245,897,880    | 654,505      |
+|                                       | Web contents        | 15,946,686     | 44,165       |
+|                                       | Encyclopedic        | 4,799,214      | 47,396       |
+|                                       | Subtotal            | 400,817,999    | 971,253      |
+| Subcorpus                             | Genre               | Nº tokens      | Nº documents |
+|---------------------------------------|---------------------|----------------|--------------|
+| Public data                           | Press and blogs     | 153,497,883    | 665,265      |
+|                                       | Encyclopedic        | 57,164,848     | 184,628      |
+|                                       | Web crawls          | 1,384,015,664  | 3,366,449    |
+|                                       | Translation corpora | 133,726,004    | 4,745,799    |
+|                                       | Subtotal            | 1,728,404,399  | 8,777,514    |
+|                                       | Total               | 2,129,222,398  | 9,748,767    |
+| Download (Zenodo)                     | https://zenodo.org/records/10687642                 |
 ### Training hyperparameters