davda54 commited on
Commit
6e965ca
1 Parent(s): 7d39a24

Added Mistral results

Browse files
Files changed (1) hide show
  1. README.md +56 -41
README.md CHANGED
@@ -84,38 +84,7 @@ The user should perform evaluation for their particular model application scenar
84
  The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
85
 
86
  Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
87
- We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).
88
-
89
-
90
- ### Reading comprehension
91
-
92
- [NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
93
-
94
- <details>
95
- <summary>Method</summary>
96
-
97
- * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
98
- * Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
99
- * Few-shot results show the average scores across 5 repetitions
100
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
101
- * Performance metrics: macro-averaged F1-score and exact match (EM).
102
-
103
- </details>
104
-
105
- <details open>
106
- <summary>Performance results on the extractive question answering task (NorQuAD)</summary>
107
-
108
- |Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
109
- |---|---|---|---|
110
- |NorMistral-7b-warm|**48.6**/**24.8**|**63.6**/**40.0**|**66.5**/43.8|
111
- |NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
112
- |NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
113
- |NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
114
- |Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
115
- |GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
116
- |GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/**44.5**|
117
-
118
- </details>
119
 
120
 
121
  ### Sentiment analysis
@@ -127,7 +96,7 @@ We use the binary formulation of this task (positive vs. negative).
127
  <summary>Method</summary>
128
 
129
  * Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
130
- * Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ".
131
  * Few-shot results show the average scores across 5 repetitions
132
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
133
  * Performance metric: macro-averaged F1-score.
@@ -143,12 +112,48 @@ We use the binary formulation of this task (positive vs. negative).
143
  |NorMistral-7b-scratch|47.3|62.2|80.1|
144
  |NorBLOOM-7b|**75.7**|73.8|65.5|
145
  |NB-GPT-J|48.4|56.5|65.2|
146
- |Falcon-7B|53.3|61.6|74.9|
147
  |GPT-Sw3-6.7B|61.5|72.2|76.5|
148
  |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
 
 
149
 
150
  </details>
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  ### Machine translation
153
 
154
  [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
@@ -157,7 +162,7 @@ We use the binary formulation of this task (positive vs. negative).
157
  <summary>Method</summary>
158
 
159
  * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
160
- * Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```.
161
  * Few-shot results show the average scores across 5 repetitions
162
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
163
  * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
@@ -173,9 +178,11 @@ We use the binary formulation of this task (positive vs. negative).
173
  |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
174
  |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
175
  |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
176
- |Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
177
  |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
178
  |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
 
 
 
179
 
180
  </details>
181
 
@@ -188,9 +195,11 @@ We use the binary formulation of this task (positive vs. negative).
188
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
189
  |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
190
  |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
191
- |Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
192
  |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
193
  |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
 
 
 
194
 
195
  </details>
196
 
@@ -204,9 +213,11 @@ We use the binary formulation of this task (positive vs. negative).
204
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
205
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
206
  |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
207
- |Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
208
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
209
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 
 
 
210
 
211
  </details>
212
 
@@ -219,9 +230,10 @@ We use the binary formulation of this task (positive vs. negative).
219
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
220
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
221
  |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
222
- |Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
223
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
224
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 
 
225
 
226
  </details>
227
 
@@ -235,9 +247,11 @@ We use the binary formulation of this task (positive vs. negative).
235
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
236
  |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
237
  |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
238
- |Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
239
  |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
240
  |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
 
 
 
241
 
242
  </details>
243
 
@@ -250,9 +264,10 @@ We use the binary formulation of this task (positive vs. negative).
250
  |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
251
  |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
252
  |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
253
- |Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
254
  |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
255
  |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
 
 
256
 
257
  </details>
258
 
 
84
  The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
85
 
86
  Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
87
+ We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
 
90
  ### Sentiment analysis
 
96
  <summary>Method</summary>
97
 
98
  * Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
99
+ * Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ". Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
100
  * Few-shot results show the average scores across 5 repetitions
101
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
102
  * Performance metric: macro-averaged F1-score.
 
112
  |NorMistral-7b-scratch|47.3|62.2|80.1|
113
  |NorBLOOM-7b|**75.7**|73.8|65.5|
114
  |NB-GPT-J|48.4|56.5|65.2|
 
115
  |GPT-Sw3-6.7B|61.5|72.2|76.5|
116
  |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
117
+ |Falcon-7B|53.3|61.6|74.9|
118
+ |Mistral-7B-v0.1|70.2|72.9|84.8|
119
 
120
  </details>
121
 
122
+
123
+
124
+ ### Reading comprehension
125
+
126
+ [NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
127
+
128
+ <details>
129
+ <summary>Method</summary>
130
+
131
+ * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
132
+ * Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
133
+ * Few-shot results show the average scores across 5 repetitions
134
+ * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
135
+ * Performance metrics: macro-averaged F1-score and exact match (EM).
136
+
137
+ </details>
138
+
139
+ <details open>
140
+ <summary>Performance results on the extractive question answering task (NorQuAD)</summary>
141
+
142
+ |Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
143
+ |---|---|---|---|
144
+ |NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
145
+ |NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
146
+ |NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
147
+ |NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
148
+ |GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
149
+ |GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
150
+ |Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
151
+ |Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
152
+
153
+ </details>
154
+
155
+
156
+
157
  ### Machine translation
158
 
159
  [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
 
162
  <summary>Method</summary>
163
 
164
  * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
165
+ * Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
166
  * Few-shot results show the average scores across 5 repetitions
167
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
168
  * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
 
178
  |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
179
  |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
180
  |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
 
181
  |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
182
  |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
183
+ |Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
184
+ |Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
185
+
186
 
187
  </details>
188
 
 
195
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
196
  |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
197
  |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
 
198
  |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
199
  |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
200
+ |Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
201
+ |Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
202
+
203
 
204
  </details>
205
 
 
213
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
214
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
215
  |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
 
216
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
217
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
218
+ |Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
219
+ |Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
220
+
221
 
222
  </details>
223
 
 
230
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
231
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
232
  |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
 
233
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
234
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
235
+ |Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
236
+ |Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
237
 
238
  </details>
239
 
 
247
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
248
  |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
249
  |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
 
250
  |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
251
  |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
252
+ |Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
253
+ |Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
254
+
255
 
256
  </details>
257
 
 
264
  |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
265
  |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
266
  |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
 
267
  |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
268
  |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
269
+ |Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
270
+ |Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
271
 
272
  </details>
273