ahb commited on
Commit
06d9cdd
1 Parent(s): 450aa46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -82,7 +82,7 @@ This model is distributed free of charge under the [MIT](https://choosealicense.
82
 
83
  # Training Data
84
 
85
- **Albertina PT-PT** was trained over a data set that resulted from gathering some openly available corpora of European Portuguese from the following sources:
86
 
87
  - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
88
  - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament's official website. We retained its European Portuguese portion.
@@ -90,7 +90,7 @@ This model is distributed free of charge under the [MIT](https://choosealicense.
90
  - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
91
 
92
 
93
- [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), in turn, was trained over the [BrWac](https://huggingface.co/datasets/brwac) data set.
94
 
95
 
96
  ## Preprocessing
 
82
 
83
  # Training Data
84
 
85
+ **Albertina PT-PT** was trained over a 2.2 billion token data set that resulted from gathering some openly available corpora of European Portuguese from the following sources:
86
 
87
  - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
88
  - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament's official website. We retained its European Portuguese portion.
 
90
  - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
91
 
92
 
93
+ [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), in turn, was trained over the 2.7 billion token [BrWac](https://huggingface.co/datasets/brwac) data set.
94
 
95
 
96
  ## Preprocessing