jarodrigues commited on
Commit
82536c8
1 Parent(s): 8750b9c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -135,10 +135,11 @@ This involves repurposing the tasks in various ways, such as generation of answe
135
  <br>
136
 
137
  For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
 
138
  The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
139
- First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
140
- Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio.
141
- Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
142
  To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
143
 
144
 
@@ -147,7 +148,7 @@ To evaluate Gervásio, the examples were randomly selected to be included in the
147
  | **Gervásio 7B PT-BR** | 0.1977 | 0.2640 | **0.7469**| **0.2136** |
148
  | **LLaMA 2** | 0.2458 | 0.2903 | 0.0913 | 0.1034 |
149
  | **LLaMA 2 Chat** | 0.2231 | 0.2959 | 0.5546 | 0.1750 |
150
- |--------------------------|----------------------|-----------------|-----------|---------------|
151
  | **Sabiá-7B** | **0.6017** | **0.7743** | 0.6847 | 0.1363 |
152
 
153
  <br>
 
135
  <br>
136
 
137
  For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
138
+
139
  The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
140
+ - First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
141
+ - Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio.
142
+ - Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
143
  To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
144
 
145
 
 
148
  | **Gervásio 7B PT-BR** | 0.1977 | 0.2640 | **0.7469**| **0.2136** |
149
  | **LLaMA 2** | 0.2458 | 0.2903 | 0.0913 | 0.1034 |
150
  | **LLaMA 2 Chat** | 0.2231 | 0.2959 | 0.5546 | 0.1750 |
151
+ ||||||
152
  | **Sabiá-7B** | **0.6017** | **0.7743** | 0.6847 | 0.1363 |
153
 
154
  <br>