jarodrigues commited on
Commit
e80ad2f
1 Parent(s): 34d1dcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -41,15 +41,16 @@ It has different versions that were trained for different variants of Portuguese
41
  namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
42
  and it is distributed free of charge and under a most permissible license.
43
 
44
- **Albertina PT-BR** is the version for American **Portuguese** from **Brazil**,
45
- and to the best of our knowledge, at the time of its initial distribution,
46
- it is an encoder specifically for this language and variant
 
 
47
  that sets a new state of the art for it, and is made publicly available
48
  and distributed for reuse.
49
 
50
 
51
-
52
- It is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
53
  For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
54
 
55
  ``` latex
@@ -101,7 +102,7 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
101
 
102
  As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
103
 
104
- To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptpt-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
105
  The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
106
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
107
  In total, around 200k training steps were taken across 50 epochs.
 
41
  namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
42
  and it is distributed free of charge and under a most permissible license.
43
 
44
+ **Albertina PT-BR No-brWaC** is a version for American **Portuguese** from **Brazil** trained on
45
+ data sets other than brWaC, and thus with a most permissive license.
46
+
47
+ You may be interested also in [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), trained on brWaC.
48
+ To the best of our knowledge, these are encoders specifically for this language and variant
49
  that sets a new state of the art for it, and is made publicly available
50
  and distributed for reuse.
51
 
52
 
53
+ **Albertina PT-BR No-brWaC** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
 
54
  For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
55
 
56
  ``` latex
 
102
 
103
  As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
104
 
105
+ To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptbr-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
106
  The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
107
  We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
108
  In total, around 200k training steps were taken across 50 epochs.