jarodrigues
commited on
Commit
•
387593d
1
Parent(s):
91201e7
Update README.md
Browse files
README.md
CHANGED
@@ -25,27 +25,33 @@ widget:
|
|
25 |
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
-
<p style="text-align: center;"> This is the model card for Albertina
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
32 |
---
|
33 |
|
34 |
-
# Albertina
|
35 |
|
36 |
-
**Albertina
|
37 |
|
38 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
39 |
developed over the DeBERTa model, with most competitive performance for this language.
|
40 |
It is distributed free of charge and under a most permissible license.
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
-
|
|
|
49 |
For further details, check the respective [publication](https://arxiv.org/abs/?):
|
50 |
|
51 |
|
@@ -68,7 +74,7 @@ Please use the above cannonical reference when using or citing this model.
|
|
68 |
|
69 |
# Model Description
|
70 |
|
71 |
-
**This model card is for Albertina
|
72 |
|
73 |
Albertina-PT-BR base is distributed under an [MIT license](https://huggingface.co/PORTULAN/albertina-ptpt/blob/main/LICENSE).
|
74 |
|
@@ -80,7 +86,7 @@ DeBERTa is distributed under an [MIT license](https://github.com/microsoft/DeBER
|
|
80 |
# Training Data
|
81 |
|
82 |
|
83 |
-
[**Albertina
|
84 |
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
85 |
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
86 |
|
@@ -96,7 +102,7 @@ We skipped the default filtering of stopwords since it would disrupt the syntact
|
|
96 |
As codebase, we resorted to the [DeBERTa V1 base](https://huggingface.co/microsoft/deberta-base), for English.
|
97 |
|
98 |
|
99 |
-
To train [**Albertina
|
100 |
The model was trained using the maximum available memory capacity resulting in a batch size of 3072 samples (192 samples per GPU).
|
101 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
102 |
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
|
@@ -107,7 +113,7 @@ The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16
|
|
107 |
|
108 |
# Evaluation
|
109 |
|
110 |
-
The base model versions was evaluated on downstream tasks, namely the translations into
|
111 |
|
112 |
|
113 |
## GLUE tasks translated
|
@@ -120,9 +126,9 @@ We address four tasks from those in PLUE, namely:
|
|
120 |
|
121 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
|
122 |
|------------------------------|----------------|----------------|-----------|-----------------|
|
123 |
-
| **Albertina
|
124 |
-
| **Albertina
|
125 |
-
| **Albertina
|
126 |
|
127 |
|
128 |
<br>
|
|
|
25 |
|
26 |
---
|
27 |
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
28 |
+
<p style="text-align: center;"> This is the model card for Albertina 100M PTBR.
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
32 |
---
|
33 |
|
34 |
+
# Albertina PTBR base
|
35 |
|
36 |
+
**Albertina 100M PTBR** is a foundation, large language model for American **Portuguese** from **Brazil**.
|
37 |
|
38 |
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
|
39 |
developed over the DeBERTa model, with most competitive performance for this language.
|
40 |
It is distributed free of charge and under a most permissible license.
|
41 |
|
42 |
+
| Albertina's Family of Models |
|
43 |
+
|----------------------------------------------------------------------------------------------------------|
|
44 |
+
| [**Albertina 1.5B PTPT**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptpt-encoder) |
|
45 |
+
| [**Albertina 1.5B PTBR**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptbr-encoder) |
|
46 |
+
| [**Albertina 1.5B PTPT 256**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptpt-encoder-256)|
|
47 |
+
| [**Albertina 1.5B PTBR 256**](https://huggingface.co/PORTULAN/albertina-1b5-portuguese-ptbr-encoder-256)|
|
48 |
+
| [**Albertina 900M PTPT**](https://huggingface.co/PORTULAN/albertina-900m-portuguese-ptpt-encoder) |
|
49 |
+
| [**Albertina 900M PTBR**](https://huggingface.co/PORTULAN/albertina-900m-portuguese-ptbr-encoder) |
|
50 |
+
| [**Albertina 100M PTPT**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptpt-encoder) |
|
51 |
+
| [**Albertina 100M PTBR**](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptbr-encoder) |
|
52 |
|
53 |
+
|
54 |
+
**Albertina 100M PTBR base** is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
|
55 |
For further details, check the respective [publication](https://arxiv.org/abs/?):
|
56 |
|
57 |
|
|
|
74 |
|
75 |
# Model Description
|
76 |
|
77 |
+
**This model card is for Albertina 100M PTBR**, with 100M parameters, 12 layers and a hidden size of 768.
|
78 |
|
79 |
Albertina-PT-BR base is distributed under an [MIT license](https://huggingface.co/PORTULAN/albertina-ptpt/blob/main/LICENSE).
|
80 |
|
|
|
86 |
# Training Data
|
87 |
|
88 |
|
89 |
+
[**Albertina P100M PTBR**](https://huggingface.co/PORTULAN/albertina-ptbr-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
|
90 |
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
91 |
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
92 |
|
|
|
102 |
As codebase, we resorted to the [DeBERTa V1 base](https://huggingface.co/microsoft/deberta-base), for English.
|
103 |
|
104 |
|
105 |
+
To train [**Albertina 100M PTBR**](https://huggingface.co/PORTULAN/albertina-ptpt-base), the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
|
106 |
The model was trained using the maximum available memory capacity resulting in a batch size of 3072 samples (192 samples per GPU).
|
107 |
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
|
108 |
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
|
|
|
113 |
|
114 |
# Evaluation
|
115 |
|
116 |
+
The base model versions was evaluated on downstream tasks, namely the translations into PTBR of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
|
117 |
|
118 |
|
119 |
## GLUE tasks translated
|
|
|
126 |
|
127 |
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
|
128 |
|------------------------------|----------------|----------------|-----------|-----------------|
|
129 |
+
| **Albertina 900M PTBR No-brWaC** | **0.7798** | 0.5070 | **0.9167**| 0.8743
|
130 |
+
| **Albertina 900M PTBR** | 0.7545 | 0.4601 | 0.9071 | **0.8910** |
|
131 |
+
| **Albertina 100M PTBR** | 0.6462 | **0.5493** | 0.8779 | 0.8501 |
|
132 |
|
133 |
|
134 |
<br>
|