alexmarques commited on
Commit
0906e2c
1 Parent(s): abca298

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -26
README.md CHANGED
@@ -32,8 +32,9 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
- Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
- It achieves an average score of 73.56 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.79.
 
37
 
38
  ### Model Optimizations
39
 
@@ -137,20 +138,29 @@ oneshot(
137
 
138
  ## Evaluation
139
 
140
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
141
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
142
- This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
143
 
144
  ### Accuracy
145
 
146
- #### Open LLM Leaderboard evaluation scores
147
  <table>
148
  <tr>
149
  <td><strong>Benchmark</strong>
150
  </td>
151
  <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
152
  </td>
153
- <td><strong>Meta-Llama-3.1-8B-Instruct-FP8-dynamic(this model)</strong>
154
  </td>
155
  <td><strong>Recovery</strong>
156
  </td>
@@ -165,12 +175,26 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
165
  <td>100.1%
166
  </td>
167
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  <tr>
169
  <td>MMLU-cot (0-shot)
170
  </td>
171
- <td>71.24
172
  </td>
173
- <td>71.64
174
  </td>
175
  <td>100.5%
176
  </td>
@@ -178,19 +202,19 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
178
  <tr>
179
  <td>ARC Challenge (0-shot)
180
  </td>
181
- <td>82.00
182
  </td>
183
- <td>81.23
184
  </td>
185
- <td>99.06%
186
  </td>
187
  </tr>
188
  <tr>
189
  <td>GSM-8K-cot (8-shot, strict-match)
190
  </td>
191
- <td>81.96
192
  </td>
193
- <td>82.03
194
  </td>
195
  <td>100.0%
196
  </td>
@@ -198,21 +222,21 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
198
  <tr>
199
  <td>Hellaswag (10-shot)
200
  </td>
201
- <td>80.46
202
  </td>
203
- <td>80.04
204
  </td>
205
- <td>99.48%
206
  </td>
207
  </tr>
208
  <tr>
209
  <td>Winogrande (5-shot)
210
  </td>
211
- <td>78.45
212
  </td>
213
- <td>77.66
214
  </td>
215
- <td>98.99%
216
  </td>
217
  </tr>
218
  <tr>
@@ -220,19 +244,117 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
220
  </td>
221
  <td>54.5
222
  </td>
223
- <td>54.28
224
  </td>
225
- <td>99.60%
226
  </td>
227
  </tr>
228
  <tr>
229
  <td><strong>Average</strong>
230
  </td>
231
- <td><strong>73.79</strong>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
  </td>
233
- <td><strong>73.56</strong>
234
  </td>
235
- <td><strong>99.70%</strong>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
  </td>
237
  </tr>
238
  </table>
@@ -313,4 +435,38 @@ lm_eval \
313
  --tasks truthfulqa \
314
  --num_fewshot 0 \
315
  --batch_size auto
316
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
+ This model is a quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-8B-Instruct-FP8-dynamic achieves 105.4% recovery for the Arena-Hard evaluation, 99.7% for OpenLLM v1 (using Meta's prompting when available), 101.2% for OpenLLM v2, 100.0% for HumanEval pass@1, and 101.0% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
138
 
139
  ## Evaluation
140
 
141
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
142
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
143
+
144
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
145
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
146
+ We report below the scores obtained in each judgement and the average.
147
+
148
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
149
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
150
+
151
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
152
+
153
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
154
 
155
  ### Accuracy
156
 
 
157
  <table>
158
  <tr>
159
  <td><strong>Benchmark</strong>
160
  </td>
161
  <td><strong>Meta-Llama-3.1-8B-Instruct </strong>
162
  </td>
163
+ <td><strong>Meta-Llama-3.1-8B-Instruct-FP8-dynamic (this model)</strong>
164
  </td>
165
  <td><strong>Recovery</strong>
166
  </td>
 
175
  <td>100.1%
176
  </td>
177
  </tr>
178
+ <tr>
179
+ <td><strong>Arena Hard</strong>
180
+ </td>
181
+ <td>25.8 (25.1 / 26.5)
182
+ </td>
183
+ <td>27.2 (27.4 / 27.0)
184
+ </td>
185
+ <td>105.4%
186
+ </td>
187
+ </tr>
188
+ <tr>
189
+ <td><strong>OpenLLM v1</strong>
190
+ </td>
191
+ </tr>
192
  <tr>
193
  <td>MMLU-cot (0-shot)
194
  </td>
195
+ <td>71.2
196
  </td>
197
+ <td>71.6
198
  </td>
199
  <td>100.5%
200
  </td>
 
202
  <tr>
203
  <td>ARC Challenge (0-shot)
204
  </td>
205
+ <td>82.0
206
  </td>
207
+ <td>81.2
208
  </td>
209
+ <td>99.1%
210
  </td>
211
  </tr>
212
  <tr>
213
  <td>GSM-8K-cot (8-shot, strict-match)
214
  </td>
215
+ <td>82.0
216
  </td>
217
+ <td>82.0
218
  </td>
219
  <td>100.0%
220
  </td>
 
222
  <tr>
223
  <td>Hellaswag (10-shot)
224
  </td>
225
+ <td>80.5
226
  </td>
227
+ <td>80.0
228
  </td>
229
+ <td>99.5%
230
  </td>
231
  </tr>
232
  <tr>
233
  <td>Winogrande (5-shot)
234
  </td>
235
+ <td>78.5
236
  </td>
237
+ <td>77.7
238
  </td>
239
+ <td>99.0%
240
  </td>
241
  </tr>
242
  <tr>
 
244
  </td>
245
  <td>54.5
246
  </td>
247
+ <td>54.3
248
  </td>
249
+ <td>99.6%
250
  </td>
251
  </tr>
252
  <tr>
253
  <td><strong>Average</strong>
254
  </td>
255
+ <td><strong>73.8</strong>
256
+ </td>
257
+ <td><strong>73.6</strong>
258
+ </td>
259
+ <td><strong>99.7%</strong>
260
+ </td>
261
+ </tr>
262
+ <tr>
263
+ <td><strong>OpenLLM v2</strong>
264
+ </td>
265
+ </tr>
266
+ <tr>
267
+ <td>MMLU-Pro (5-shot)
268
+ </td>
269
+ <td>30.8
270
+ </td>
271
+ <td>31.2
272
+ </td>
273
+ <td>101.3%
274
+ </td>
275
+ </tr>
276
+ <tr>
277
+ <td>IFEval (0-shot)
278
+ </td>
279
+ <td>77.9
280
+ </td>
281
+ <td>77.2
282
+ </td>
283
+ <td>99.1%
284
+ </td>
285
+ </tr>
286
+ <tr>
287
+ <td>BBH (3-shot)
288
+ </td>
289
+ <td>30.1
290
+ </td>
291
+ <td>29.7
292
+ </td>
293
+ <td>98.5%
294
+ </td>
295
+ </tr>
296
+ <tr>
297
+ <td>Math-|v|-5 (4-shot)
298
+ </td>
299
+ <td>15.7
300
+ </td>
301
+ <td>16.5
302
+ </td>
303
+ <td>105.4%
304
+ </td>
305
+ </tr>
306
+ <tr>
307
+ <td>GPQA (0-shot)
308
+ </td>
309
+ <td>3.7
310
  </td>
311
+ <td>5.7
312
  </td>
313
+ <td>156.0%
314
+ </td>
315
+ </tr>
316
+ <tr>
317
+ <td>MuSR (0-shot)
318
+ </td>
319
+ <td>7.6
320
+ </td>
321
+ <td>7.5
322
+ </td>
323
+ <td>98.8%
324
+ </td>
325
+ </tr>
326
+ <tr>
327
+ <td><strong>Average</strong>
328
+ </td>
329
+ <td><strong>27.6</strong>
330
+ </td>
331
+ <td><strong>28.0</strong>
332
+ </td>
333
+ <td><strong>101.2%</strong>
334
+ </td>
335
+ </tr>
336
+ <tr>
337
+ <td><strong>Coding</strong>
338
+ </td>
339
+ </tr>
340
+ <tr>
341
+ <td>HumanEval pass@1
342
+ </td>
343
+ <td>67.3
344
+ </td>
345
+ <td>67.3
346
+ </td>
347
+ <td>100.0%
348
+ </td>
349
+ </tr>
350
+ <tr>
351
+ <td>HumanEval+ pass@1
352
+ </td>
353
+ <td>60.7
354
+ </td>
355
+ <td>61.3
356
+ </td>
357
+ <td>101.0%
358
  </td>
359
  </tr>
360
  </table>
 
435
  --tasks truthfulqa \
436
  --num_fewshot 0 \
437
  --batch_size auto
438
+ ```
439
+
440
+ #### OpenLLM v2
441
+ ```
442
+ lm_eval \
443
+ --model vllm \
444
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
445
+ --apply_chat_template \
446
+ --fewshot_as_multiturn \
447
+ --tasks leaderboard \
448
+ --batch_size auto
449
+ ```
450
+
451
+ #### HumanEval and HumanEval+
452
+ ##### Generation
453
+ ```
454
+ python3 codegen/generate.py \
455
+ --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic \
456
+ --bs 16 \
457
+ --temperature 0.2 \
458
+ --n_samples 50 \
459
+ --root "." \
460
+ --dataset humaneval
461
+ ```
462
+ ##### Sanitization
463
+ ```
464
+ python3 evalplus/sanitize.py \
465
+ humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-FP8-dynamic_vllm_temp_0.2
466
+ ```
467
+ ##### Evaluation
468
+ ```
469
+ evalplus.evaluate \
470
+ --dataset humaneval \
471
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-FP8-dynamic_vllm_temp_0.2-sanitized
472
+ ```