jarodrigues
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -77,13 +77,28 @@ Gervásio-7B-PTPT-Instruct-Decoder is distributed under an [MIT license](https:/
|
|
77 |
|
78 |
# Training Data
|
79 |
|
80 |
-
|
81 |
-
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
|
82 |
-
It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
83 |
-
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
|
84 |
-
We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
85 |
|
86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
## Preprocessing
|
89 |
|
|
|
77 |
|
78 |
# Training Data
|
79 |
|
80 |
+
**Gervásio-7B-PTPT-Instruct-Decoder** over standard supervised fine-tuning, and to keep some alignment with mainstream benchmarks for English, we resorted to tasks and respective datasets in the GLUE and the SuperGLUE collections.
|
|
|
|
|
|
|
|
|
81 |
|
82 |
|
83 |
+
We selected those datasets where the outcome of their machine translation into Portuguese could preserve, in the target language, the linguistic properties at stake.
|
84 |
+
|
85 |
+
From GLUE, we resorted to the following four tasks:
|
86 |
+
- MRPC (paraphrase Detection).
|
87 |
+
- RTE (recognizing Textual Entailment).
|
88 |
+
- STS-B (semantic textual similarity).
|
89 |
+
- WNLI (coreference and natural language inference).
|
90 |
+
|
91 |
+
And from SuperGLUE, we included these other four tasks:
|
92 |
+
- BoolQ (yes/no question answering).
|
93 |
+
- CB (inference with 3 labels).
|
94 |
+
- COPA (reasoning)
|
95 |
+
- MultiRC (question answering).
|
96 |
+
|
97 |
+
|
98 |
+
Instruction templates have been manually crafted for each task.
|
99 |
+
These take the various fields in the dataset and arrange them into a prompt.
|
100 |
+
For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset.
|
101 |
+
These templates are listed in full detail in TODO.
|
102 |
|
103 |
## Preprocessing
|
104 |
|