chiliu commited on
Commit
b83fa45
1 Parent(s): c897e70
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -175,30 +175,30 @@ We evaluated OpenLLaMA on a wide range of tasks using [lm-evaluation-harness](ht
175
  The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens. We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
176
 
177
 
178
- | **Task/Metric** | Mamba-GPT 3B | LLaMA 7B | OpenLLaMA 7B | OpenLLaMA 3B | OpenLLaMA 13B 600BT |
179
- | ---------------------- | -------- | -------- | ------------ | ------------ | ------------------- |
180
- | anli_r1/acc | **0.35** | 0.35 | 0.33 | 0.33 | 0.33 |
181
- | anli_r2/acc | 0.33 | 0.34 | 0.36 | 0.32 | 0.35 |
182
- | anli_r3/acc | 0.35 | 0.37 | 0.38 | 0.35 | 0.38 |
183
- | arc_challenge/acc | 0.35 | 0.39 | 0.37 | 0.34 | 0.39 |
184
- | arc_challenge/acc_norm | 0.37 | 0.41 | 0.38 | 0.37 | 0.42 |
185
- | arc_easy/acc | 0.71 | 0.68 | 0.72 | 0.69 | 0.74 |
186
- | arc_easy/acc_norm | 0.65 | 0.52 | 0.68 | 0.65 | 0.70 |
187
- | boolq/acc | **0.72** | 0.56 | 0.53 | 0.66 | 0.71 |
188
- | hellaswag/acc | 0.49 | 0.36 | 0.63 | 0.43 | 0.54 |
189
- | hellaswag/acc_norm | 0.66 | 0.73 | 0.72 | 0.67 | 0.73 |
190
- | openbookqa/acc | 0.26 | 0.29 | 0.30 | 0.27 | 0.30 |
191
- | openbookqa/acc_norm | 0.40 | 0.41 | 0.40 | 0.40 | 0.41 |
192
- | piqa/acc | 0.76 | 0.78 | 0.76 | 0.75 | 0.77 |
193
- | piqa/acc_norm | 0.76 | 0.78 | 0.77 | 0.76 | 0.78 |
194
- | record/em | 0.88 | 0.91 | 0.89 | 0.88 | 0.90 |
195
- | record/f1 | 0.88 | 0.91 | 0.90 | 0.89 | 0.90 |
196
- | rte/acc | 0.55 | 0.56 | 0.60 | 0.58 | 0.65 |
197
- | truthfulqa_mc/mc1 | **0.27** | 0.21 | 0.23 | 0.22 | 0.22 |
198
- | truthfulqa_mc/mc2 | **0.37** | 0.34 | 0.35 | 0.35 | 0.35 |
199
- | wic/acc | 0.49 | 0.50 | 0.51 | 0.48 | 0.49 |
200
- | winogrande/acc | 0.63 | 0.68 | 0.67 | 0.62 | 0.67 |
201
- | Average | 0.53 | 0.53 | 0.55 | 0.52 | 0.56 |
202
 
203
 
204
  We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.
 
175
  The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens. We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
176
 
177
 
178
+ | **Task/Metric** | finetuned-GPT 3B | OpenLLaMA 3B |
179
+ | ---------------------- | -------- | ------------ |
180
+ | anli_r1/acc | **0.35** | 0.33 |
181
+ | anli_r2/acc | **0.33** | 0.32 |
182
+ | anli_r3/acc | 0.35 | 0.35 |
183
+ | arc_challenge/acc | **0.35** | 0.34 |
184
+ | arc_challenge/acc_norm | 0.37 | 0.37 |
185
+ | arc_easy/acc | **0.71** | 0.69 |
186
+ | arc_easy/acc_norm | 0.65 | 0.65 |
187
+ | boolq/acc | **0.72** | 0.66 |
188
+ | hellaswag/acc | **0.49** | 0.43 |
189
+ | hellaswag/acc_norm | 0.66 | 0.67 |
190
+ | openbookqa/acc | 0.26 | 0.27 |
191
+ | openbookqa/acc_norm | 0.40 | 0.40 |
192
+ | piqa/acc | **0.76** | 0.75 |
193
+ | piqa/acc_norm | 0.76 | 0.76 |
194
+ | record/em | 0.88 | 0.88 |
195
+ | record/f1 | 0.88 | 0.89 |
196
+ | rte/acc | 0.55 | 0.58 |
197
+ | truthfulqa_mc/mc1 | **0.27** | 0.22 |
198
+ | truthfulqa_mc/mc2 | **0.37** | 0.35 |
199
+ | wic/acc | **0.49** | 0.48 |
200
+ | winogrande/acc | **0.63** | 0.62 |
201
+ | Average | **0.53** | 0.52 |
202
 
203
 
204
  We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.