PORTULAN
/

albertina-900m-portuguese-ptbr-encoder

albertina-ptbr-nobrwac

foundation model

Inference Endpoints

Model card Files Files and versions Community

jarodrigues commited on Jun 20, 2023

Commit

e80ad2f

•

1 Parent(s): 34d1dcf

Update README.md

Files changed (1) hide show

README.md +7 -6

README.md CHANGED Viewed

@@ -41,15 +41,16 @@ It has different versions that were trained for different variants of Portuguese
 namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
 and it is distributed free of charge and under a most permissible license.
-**Albertina PT-BR** is the version for American **Portuguese** from **Brazil**,
-and to the best of our knowledge, at the time of its initial distribution,
-it is an encoder specifically for this language and variant
 that sets a new state of the art for it, and is made publicly available
 and distributed for reuse.
-It is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
 For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
 ``` latex
@@ -101,7 +102,7 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
 As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
-To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptpt-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
 The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
 In total, around 200k training steps were taken across 50 epochs.

 namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
 and it is distributed free of charge and under a most permissible license.
+**Albertina PT-BR No-brWaC** is a version for American **Portuguese** from **Brazil** trained on
+data sets other than brWaC, and thus with a most permissive license.
+You may be interested also in [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), trained on brWaC.
+To the best of our knowledge, these are encoders specifically for this language and variant
 that sets a new state of the art for it, and is made publicly available
 and distributed for reuse.
+**Albertina PT-BR No-brWaC** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
 For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
 ``` latex
 As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
+To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptbr-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
 The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
 We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
 In total, around 200k training steps were taken across 50 epochs.