jarodrigues
commited on
Commit
•
e80ad2f
1
Parent(s):
34d1dcf
Update README.md
Browse files
README.md
CHANGED
@@ -41,15 +41,16 @@ It has different versions that were trained for different variants of Portuguese
|
|
41 |
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
|
42 |
and it is distributed free of charge and under a most permissible license.
|
43 |
|
44 |
-
**Albertina PT-BR** is
|
45 |
-
|
46 |
-
|
|
|
|
|
47 |
that sets a new state of the art for it, and is made publicly available
|
48 |
and distributed for reuse.
|
49 |
|
50 |
|
51 |
-
|
52 |
-
It is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
53 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
54 |
|
55 |
``` latex
|
@@ -101,7 +102,7 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
|
|
101 |
|
102 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
103 |
|
104 |
-
To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-
|
105 |
The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
|
106 |
We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
107 |
In total, around 200k training steps were taken across 50 epochs.
|
|
|
41 |
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
|
42 |
and it is distributed free of charge and under a most permissible license.
|
43 |
|
44 |
+
**Albertina PT-BR No-brWaC** is a version for American **Portuguese** from **Brazil** trained on
|
45 |
+
data sets other than brWaC, and thus with a most permissive license.
|
46 |
+
|
47 |
+
You may be interested also in [**Albertina PT-BR**](https://huggingface.co/PORTULAN/albertina-ptbr), trained on brWaC.
|
48 |
+
To the best of our knowledge, these are encoders specifically for this language and variant
|
49 |
that sets a new state of the art for it, and is made publicly available
|
50 |
and distributed for reuse.
|
51 |
|
52 |
|
53 |
+
**Albertina PT-BR No-brWaC** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
|
|
54 |
For further details, check the respective [publication](https://arxiv.org/abs/2305.06721):
|
55 |
|
56 |
``` latex
|
|
|
102 |
|
103 |
As codebase, we resorted to the [DeBERTa V2 XLarge](https://huggingface.co/microsoft/deberta-v2-xlarge), for English.
|
104 |
|
105 |
+
To train [**Albertina PT-PT No-brWac**](https://huggingface.co/PORTULAN/albertina-ptbr-nobrwac), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
106 |
The model was trained using the maximum available memory capacity resulting in a batch size of 896 samples (56 samples per GPU).
|
107 |
We chose a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
108 |
In total, around 200k training steps were taken across 50 epochs.
|