alexmarques commited on
Commit
153436a
1 Parent(s): 764531e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -23
README.md CHANGED
@@ -31,8 +31,9 @@ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
- Quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
35
- It achieves scores within 3.1% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag and Winogrande, and TruthfulQA.
 
36
 
37
  ### Model Optimizations
38
 
@@ -121,13 +122,21 @@ model.quantize(examples)
121
  model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
122
  ```
123
 
 
124
 
 
 
125
 
126
- ## Evaluation
 
 
 
 
 
127
 
128
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
129
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
130
- This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
131
 
132
  **Note:** Results have been updated after Meta modified the chat template.
133
 
@@ -145,12 +154,26 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
145
  <td><strong>Recovery</strong>
146
  </td>
147
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  <tr>
149
  <td>MMLU (5-shot)
150
  </td>
151
- <td>68.32
152
  </td>
153
- <td>66.89
154
  </td>
155
  <td>97.9%
156
  </td>
@@ -158,9 +181,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
158
  <tr>
159
  <td>MMLU (CoT, 0-shot)
160
  </td>
161
- <td>72.83
162
  </td>
163
- <td>71.06
164
  </td>
165
  <td>97.6%
166
  </td>
@@ -168,9 +191,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
168
  <tr>
169
  <td>ARC Challenge (0-shot)
170
  </td>
171
- <td>81.40
172
  </td>
173
- <td>80.20
174
  </td>
175
  <td>98.0%
176
  </td>
@@ -178,9 +201,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
178
  <tr>
179
  <td>GSM-8K (CoT, 8-shot, strict-match)
180
  </td>
181
- <td>82.79
182
  </td>
183
- <td>82.94
184
  </td>
185
  <td>100.2%
186
  </td>
@@ -188,9 +211,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
188
  <tr>
189
  <td>Hellaswag (10-shot)
190
  </td>
191
- <td>80.47
192
  </td>
193
- <td>79.87
194
  </td>
195
  <td>99.3%
196
  </td>
@@ -198,9 +221,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
198
  <tr>
199
  <td>Winogrande (5-shot)
200
  </td>
201
- <td>78.06
202
  </td>
203
- <td>77.98
204
  </td>
205
  <td>99.9%
206
  </td>
@@ -208,9 +231,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
208
  <tr>
209
  <td>TruthfulQA (0-shot, mc2)
210
  </td>
211
- <td>54.48
212
  </td>
213
- <td>52.81
214
  </td>
215
  <td>96.9%
216
  </td>
@@ -218,13 +241,111 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
218
  <tr>
219
  <td><strong>Average</strong>
220
  </td>
221
- <td><strong>74.25</strong>
222
  </td>
223
- <td><strong>73.45</strong>
224
  </td>
225
  <td><strong>98.9%</strong>
226
  </td>
227
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  </table>
229
 
230
  ### Reproduction
@@ -305,4 +426,38 @@ lm_eval \
305
  --tasks truthfulqa \
306
  --num_fewshot 0 \
307
  --batch_size auto
308
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
+ This model is a quantized version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
35
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
36
+ Meta-Llama-3.1-8B-Instruct-quantized.w4a16 achieves 93.0% recovery for the Arena-Hard evaluation, 98.9% for OpenLLM v1 (using Meta's prompting when available), 96.1% for OpenLLM v2, 99.7% for HumanEval pass@1, and 97.4% for HumanEval+ pass@1.
37
 
38
  ### Model Optimizations
39
 
 
122
  model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w4a16")
123
  ```
124
 
125
+ ## Evaluation
126
 
127
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
128
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
129
 
130
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
131
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
132
+ We report below the scores obtained in each judgement and the average.
133
+
134
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
135
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
136
 
137
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
138
+
139
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
140
 
141
  **Note:** Results have been updated after Meta modified the chat template.
142
 
 
154
  <td><strong>Recovery</strong>
155
  </td>
156
  </tr>
157
+ <tr>
158
+ <td><strong>Arena Hard</strong>
159
+ </td>
160
+ <td>25.8 (25.1 / 26.5)
161
+ </td>
162
+ <td>24.0 (23.4 / 24.6)
163
+ </td>
164
+ <td>93.0%
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td><strong>OpenLLM v1</strong>
169
+ </td>
170
+ </tr>
171
  <tr>
172
  <td>MMLU (5-shot)
173
  </td>
174
+ <td>68.3
175
  </td>
176
+ <td>66.9
177
  </td>
178
  <td>97.9%
179
  </td>
 
181
  <tr>
182
  <td>MMLU (CoT, 0-shot)
183
  </td>
184
+ <td>72.8
185
  </td>
186
+ <td>71.1
187
  </td>
188
  <td>97.6%
189
  </td>
 
191
  <tr>
192
  <td>ARC Challenge (0-shot)
193
  </td>
194
+ <td>81.4
195
  </td>
196
+ <td>80.2
197
  </td>
198
  <td>98.0%
199
  </td>
 
201
  <tr>
202
  <td>GSM-8K (CoT, 8-shot, strict-match)
203
  </td>
204
+ <td>82.8
205
  </td>
206
+ <td>82.9
207
  </td>
208
  <td>100.2%
209
  </td>
 
211
  <tr>
212
  <td>Hellaswag (10-shot)
213
  </td>
214
+ <td>80.5
215
  </td>
216
+ <td>79.9
217
  </td>
218
  <td>99.3%
219
  </td>
 
221
  <tr>
222
  <td>Winogrande (5-shot)
223
  </td>
224
+ <td>78.1
225
  </td>
226
+ <td>78.0
227
  </td>
228
  <td>99.9%
229
  </td>
 
231
  <tr>
232
  <td>TruthfulQA (0-shot, mc2)
233
  </td>
234
+ <td>54.5
235
  </td>
236
+ <td>52.8
237
  </td>
238
  <td>96.9%
239
  </td>
 
241
  <tr>
242
  <td><strong>Average</strong>
243
  </td>
244
+ <td><strong>74.3</strong>
245
  </td>
246
+ <td><strong>73.5</strong>
247
  </td>
248
  <td><strong>98.9%</strong>
249
  </td>
250
  </tr>
251
+ <tr>
252
+ <td><strong>OpenLLM v2</strong>
253
+ </td>
254
+ </tr>
255
+ <tr>
256
+ <td>MMLU-Pro (5-shot)
257
+ </td>
258
+ <td>30.8
259
+ </td>
260
+ <td>28.8
261
+ </td>
262
+ <td>93.6%
263
+ </td>
264
+ </tr>
265
+ <tr>
266
+ <td>IFEval (0-shot)
267
+ </td>
268
+ <td>77.9
269
+ </td>
270
+ <td>76.3
271
+ </td>
272
+ <td>98.0%
273
+ </td>
274
+ </tr>
275
+ <tr>
276
+ <td>BBH (3-shot)
277
+ </td>
278
+ <td>30.1
279
+ </td>
280
+ <td>28.9
281
+ </td>
282
+ <td>96.1%
283
+ </td>
284
+ </tr>
285
+ <tr>
286
+ <td>Math-|v|-5 (4-shot)
287
+ </td>
288
+ <td>15.7
289
+ </td>
290
+ <td>14.8
291
+ </td>
292
+ <td>94.4%
293
+ </td>
294
+ </tr>
295
+ <tr>
296
+ <td>GPQA (0-shot)
297
+ </td>
298
+ <td>3.7
299
+ </td>
300
+ <td>4.0
301
+ </td>
302
+ <td>109.8%
303
+ </td>
304
+ </tr>
305
+ <tr>
306
+ <td>MuSR (0-shot)
307
+ </td>
308
+ <td>7.6
309
+ </td>
310
+ <td>6.3
311
+ </td>
312
+ <td>83.2%
313
+ </td>
314
+ </tr>
315
+ <tr>
316
+ <td><strong>Average</strong>
317
+ </td>
318
+ <td><strong>27.6</strong>
319
+ </td>
320
+ <td><strong>26.5</strong>
321
+ </td>
322
+ <td><strong>96.1%</strong>
323
+ </td>
324
+ </tr>
325
+ <tr>
326
+ <td><strong>Coding</strong>
327
+ </td>
328
+ </tr>
329
+ <tr>
330
+ <td>HumanEval pass@1
331
+ </td>
332
+ <td>67.3
333
+ </td>
334
+ <td>67.1
335
+ </td>
336
+ <td>99.7%
337
+ </td>
338
+ </tr>
339
+ <tr>
340
+ <td>HumanEval+ pass@1
341
+ </td>
342
+ <td>60.7
343
+ </td>
344
+ <td>59.1
345
+ </td>
346
+ <td>97.4%
347
+ </td>
348
+ </tr>
349
  </table>
350
 
351
  ### Reproduction
 
426
  --tasks truthfulqa \
427
  --num_fewshot 0 \
428
  --batch_size auto
429
+ ```
430
+
431
+ #### OpenLLM v2
432
+ ```
433
+ lm_eval \
434
+ --model vllm \
435
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
436
+ --apply_chat_template \
437
+ --fewshot_as_multiturn \
438
+ --tasks leaderboard \
439
+ --batch_size auto
440
+ ```
441
+
442
+ #### HumanEval and HumanEval+
443
+ ##### Generation
444
+ ```
445
+ python3 codegen/generate.py \
446
+ --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 \
447
+ --bs 16 \
448
+ --temperature 0.2 \
449
+ --n_samples 50 \
450
+ --root "." \
451
+ --dataset humaneval
452
+ ```
453
+ ##### Sanitization
454
+ ```
455
+ python3 evalplus/sanitize.py \
456
+ humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-quantized.w4a16_vllm_temp_0.2
457
+ ```
458
+ ##### Evaluation
459
+ ```
460
+ evalplus.evaluate \
461
+ --dataset humaneval \
462
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-8B-Instruct-quantized.w4a16_vllm_temp_0.2-sanitized
463
+ ```