Commit
7179ae7
1 Parent(s): 5f20240

Adding Evaluation Results (#3)

Browse files

- Adding Evaluation Results (f4c0ed9c9b1810a470e146d8c930e65f40cad164)


Co-authored-by: Open LLM Leaderboard PR Bot <leaderboard-pr-bot@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +94 -66
README.md CHANGED
@@ -1,23 +1,37 @@
1
  ---
 
 
 
2
  tags:
3
  - generated_from_trainer
4
- license: mit
5
  datasets:
6
  - HuggingFaceH4/ultrachat_200k
7
  - HuggingFaceH4/ultrafeedback_binarized
8
- language:
9
- - en
10
  base_model: mistralai/Mistral-7B-v0.1
11
  widget:
12
- - text: "<|system|>\nYou are a pirate chatbot who always responds with Arr!</s>\n<|user|>\nThere's a llama on my lawn, how can I get rid of him?</s>\n<|assistant|>\n"
13
- output:
14
- text: "Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight, but I've got a plan that might help ye get rid of 'im. Ye'll need to gather some carrots and hay, and then lure the llama away with the promise of a tasty treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet once again. But beware, me hearty, for there may be more llamas where that one came from! Arr!"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  pipeline_tag: text-generation
16
  model-index:
17
  - name: zephyr-7b-beta
18
  results:
19
- # AI2 Reasoning Challenge (25-Shot)
20
- - task:
21
  type: text-generation
22
  name: Text Generation
23
  dataset:
@@ -28,15 +42,16 @@ model-index:
28
  args:
29
  num_few_shot: 25
30
  metrics:
31
- - type: acc_norm
32
- name: normalized accuracy
33
- value: 62.03071672354948
 
 
 
34
  source:
35
- name: Open LLM Leaderboard
36
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
37
-
38
- # HellaSwag (10-shot)
39
- - task:
40
  type: text-generation
41
  name: Text Generation
42
  dataset:
@@ -46,15 +61,16 @@ model-index:
46
  args:
47
  num_few_shot: 10
48
  metrics:
49
- - type: acc_norm
50
- name: normalized accuracy
51
- value: 84.35570603465445
 
 
 
52
  source:
53
- name: Open LLM Leaderboard
54
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
55
-
56
- # DROP (3-shot)
57
- - task:
58
  type: text-generation
59
  name: Text Generation
60
  dataset:
@@ -64,15 +80,13 @@ model-index:
64
  args:
65
  num_few_shot: 3
66
  metrics:
67
- - type: f1
68
- name: f1 score
69
- value: 9.662437080536909
70
  source:
71
- name: Open LLM Leaderboard
72
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
73
-
74
- # TruthfulQA (0-shot)
75
- - task:
76
  type: text-generation
77
  name: Text Generation
78
  dataset:
@@ -83,14 +97,14 @@ model-index:
83
  args:
84
  num_few_shot: 0
85
  metrics:
86
- - type: mc2
87
- value: 57.44916942762855
 
 
88
  source:
89
- name: Open LLM Leaderboard
90
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
91
-
92
- # GSM8k (5-shot)
93
- - task:
94
  type: text-generation
95
  name: Text Generation
96
  dataset:
@@ -101,15 +115,16 @@ model-index:
101
  args:
102
  num_few_shot: 5
103
  metrics:
104
- - type: acc
105
- name: accuracy
106
- value: 12.736921910538287
 
 
 
107
  source:
108
- name: Open LLM Leaderboard
109
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
110
-
111
- # MMLU (5-Shot)
112
- - task:
113
  type: text-generation
114
  name: Text Generation
115
  dataset:
@@ -120,15 +135,16 @@ model-index:
120
  args:
121
  num_few_shot: 5
122
  metrics:
123
- - type: acc
124
- name: accuracy
125
- value: 61.07
 
 
 
126
  source:
127
- name: Open LLM Leaderboard
128
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
129
-
130
- # Winogrande (5-shot)
131
- - task:
132
  type: text-generation
133
  name: Text Generation
134
  dataset:
@@ -139,38 +155,37 @@ model-index:
139
  args:
140
  num_few_shot: 5
141
  metrics:
142
- - type: acc
143
- name: accuracy
144
- value: 77.74269928966061
 
 
 
145
  source:
146
- name: Open LLM Leaderboard
147
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
148
-
149
- # AlpacaEval (taken from model card)
150
- - task:
151
  type: text-generation
152
  name: Text Generation
153
  dataset:
154
  name: AlpacaEval
155
  type: tatsu-lab/alpaca_eval
156
  metrics:
157
- - type: unknown
158
- name: win rate
159
- value: 0.9060
160
  source:
161
  url: https://tatsu-lab.github.io/alpaca_eval/
162
-
163
- # MT-Bench (taken from model card)
164
- - task:
165
  type: text-generation
166
  name: Text Generation
167
  dataset:
168
  name: MT-Bench
169
  type: unknown
170
  metrics:
171
- - type: unknown
172
- name: score
173
- value: 7.34
174
  source:
175
  url: https://huggingface.co/spaces/lmsys/mt-bench
176
  ---
@@ -407,4 +422,17 @@ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-le
407
  | TruthfulQA (0-shot) | 57.45 |
408
  | Winogrande (5-shot) | 77.74 |
409
  | GSM8K (5-shot) | 12.74 |
410
- | DROP (3-shot) | 9.66 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: mit
5
  tags:
6
  - generated_from_trainer
 
7
  datasets:
8
  - HuggingFaceH4/ultrachat_200k
9
  - HuggingFaceH4/ultrafeedback_binarized
 
 
10
  base_model: mistralai/Mistral-7B-v0.1
11
  widget:
12
+ - text: '<|system|>
13
+
14
+ You are a pirate chatbot who always responds with Arr!</s>
15
+
16
+ <|user|>
17
+
18
+ There''s a llama on my lawn, how can I get rid of him?</s>
19
+
20
+ <|assistant|>
21
+
22
+ '
23
+ output:
24
+ text: Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight,
25
+ but I've got a plan that might help ye get rid of 'im. Ye'll need to gather
26
+ some carrots and hay, and then lure the llama away with the promise of a tasty
27
+ treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet
28
+ once again. But beware, me hearty, for there may be more llamas where that one
29
+ came from! Arr!
30
  pipeline_tag: text-generation
31
  model-index:
32
  - name: zephyr-7b-beta
33
  results:
34
+ - task:
 
35
  type: text-generation
36
  name: Text Generation
37
  dataset:
 
42
  args:
43
  num_few_shot: 25
44
  metrics:
45
+ - type: acc_norm
46
+ value: 62.03071672354948
47
+ name: normalized accuracy
48
+ - type: acc_norm
49
+ value: 58.28
50
+ name: normalized accuracy
51
  source:
 
52
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
53
+ name: Open LLM Leaderboard
54
+ - task:
 
55
  type: text-generation
56
  name: Text Generation
57
  dataset:
 
61
  args:
62
  num_few_shot: 10
63
  metrics:
64
+ - type: acc_norm
65
+ value: 84.35570603465445
66
+ name: normalized accuracy
67
+ - type: acc_norm
68
+ value: 81.0
69
+ name: normalized accuracy
70
  source:
 
71
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
72
+ name: Open LLM Leaderboard
73
+ - task:
 
74
  type: text-generation
75
  name: Text Generation
76
  dataset:
 
80
  args:
81
  num_few_shot: 3
82
  metrics:
83
+ - type: f1
84
+ value: 9.66243708053691
85
+ name: f1 score
86
  source:
 
87
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
88
+ name: Open LLM Leaderboard
89
+ - task:
 
90
  type: text-generation
91
  name: Text Generation
92
  dataset:
 
97
  args:
98
  num_few_shot: 0
99
  metrics:
100
+ - type: mc2
101
+ value: 57.44916942762855
102
+ - type: mc2
103
+ value: 46.1
104
  source:
 
105
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
106
+ name: Open LLM Leaderboard
107
+ - task:
 
108
  type: text-generation
109
  name: Text Generation
110
  dataset:
 
115
  args:
116
  num_few_shot: 5
117
  metrics:
118
+ - type: acc
119
+ value: 12.736921910538287
120
+ name: accuracy
121
+ - type: acc
122
+ value: 13.04
123
+ name: accuracy
124
  source:
 
125
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
126
+ name: Open LLM Leaderboard
127
+ - task:
 
128
  type: text-generation
129
  name: Text Generation
130
  dataset:
 
135
  args:
136
  num_few_shot: 5
137
  metrics:
138
+ - type: acc
139
+ value: 61.07
140
+ name: accuracy
141
+ - type: acc
142
+ value: 53.57
143
+ name: accuracy
144
  source:
 
145
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
146
+ name: Open LLM Leaderboard
147
+ - task:
 
148
  type: text-generation
149
  name: Text Generation
150
  dataset:
 
155
  args:
156
  num_few_shot: 5
157
  metrics:
158
+ - type: acc
159
+ value: 77.7426992896606
160
+ name: accuracy
161
+ - type: acc
162
+ value: 74.74
163
+ name: accuracy
164
  source:
 
165
  url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceH4/zephyr-7b-beta
166
+ name: Open LLM Leaderboard
167
+ - task:
 
168
  type: text-generation
169
  name: Text Generation
170
  dataset:
171
  name: AlpacaEval
172
  type: tatsu-lab/alpaca_eval
173
  metrics:
174
+ - type: unknown
175
+ value: 0.906
176
+ name: win rate
177
  source:
178
  url: https://tatsu-lab.github.io/alpaca_eval/
179
+ - task:
 
 
180
  type: text-generation
181
  name: Text Generation
182
  dataset:
183
  name: MT-Bench
184
  type: unknown
185
  metrics:
186
+ - type: unknown
187
+ value: 7.34
188
+ name: score
189
  source:
190
  url: https://huggingface.co/spaces/lmsys/mt-bench
191
  ---
 
422
  | TruthfulQA (0-shot) | 57.45 |
423
  | Winogrande (5-shot) | 77.74 |
424
  | GSM8K (5-shot) | 12.74 |
425
+ | DROP (3-shot) | 9.66 |
426
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
427
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_CallComply__zephyr-7b-beta-128k)
428
+
429
+ | Metric |Value|
430
+ |---------------------------------|----:|
431
+ |Avg. |54.45|
432
+ |AI2 Reasoning Challenge (25-Shot)|58.28|
433
+ |HellaSwag (10-Shot) |81.00|
434
+ |MMLU (5-Shot) |53.57|
435
+ |TruthfulQA (0-shot) |46.10|
436
+ |Winogrande (5-shot) |74.74|
437
+ |GSM8k (5-shot) |13.04|
438
+