alexmarques commited on
Commit
3188a5c
·
verified ·
1 Parent(s): c2358a5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -27
README.md CHANGED
@@ -25,15 +25,16 @@ base_model: meta-llama/Meta-Llama-3.1-405B-Instruct
25
  - **Model Optimizations:**
26
  - **Weight quantization:** FP8
27
  - **Activation quantization:** FP8
28
- - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), this models is intended for assistant-like chat.
29
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
  - **Release Date:** 8/22/2024
31
  - **Version:** 1.1
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
- Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) with the updated 8 kv-heads.
36
- It achieves an average score of 86.86 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
 
37
 
38
  ### Model Optimizations
39
 
@@ -138,9 +139,19 @@ oneshot(
138
 
139
  ## Evaluation
140
 
141
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
142
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
143
- This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
144
 
145
  ### Accuracy
146
 
@@ -151,17 +162,31 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
151
  </td>
152
  <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
153
  </td>
154
- <td><strong>Meta-Llama-3.1-405B-Instruct-FP8-dynamic(this model)</strong>
155
  </td>
156
  <td><strong>Recovery</strong>
157
  </td>
158
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  <tr>
160
  <td>MMLU (5-shot)
161
  </td>
162
- <td>87.41
163
  </td>
164
- <td>87.46
165
  </td>
166
  <td>100.0%
167
  </td>
@@ -169,9 +194,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
169
  <tr>
170
  <td>MMLU-cot (0-shot)
171
  </td>
172
- <td>88.11
173
  </td>
174
- <td>88.11
175
  </td>
176
  <td>100.0%
177
  </td>
@@ -179,9 +204,9 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
179
  <tr>
180
  <td>ARC Challenge (0-shot)
181
  </td>
182
- <td>94.97
183
  </td>
184
- <td>94.97
185
  </td>
186
  <td>100.0%
187
  </td>
@@ -189,29 +214,29 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
189
  <tr>
190
  <td>GSM-8K-cot (8-shot, strict-match)
191
  </td>
192
- <td>95.98
193
  </td>
194
- <td>95.75
195
  </td>
196
- <td>99.76%
197
  </td>
198
  </tr>
199
  <tr>
200
  <td>Hellaswag (10-shot)
201
  </td>
202
- <td>88.54
203
  </td>
204
- <td>88.45
205
  </td>
206
- <td>99.90%
207
  </td>
208
  </tr>
209
  <tr>
210
  <td>Winogrande (5-shot)
211
  </td>
212
- <td>87.21
213
  </td>
214
- <td>88.00
215
  </td>
216
  <td>100.9%
217
  </td>
@@ -219,23 +244,121 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
219
  <tr>
220
  <td>TruthfulQA (0-shot, mc2)
221
  </td>
222
- <td>65.31
223
  </td>
224
- <td>65.25
225
  </td>
226
- <td>99.91%
227
  </td>
228
  </tr>
229
  <tr>
230
  <td><strong>Average</strong>
231
  </td>
232
- <td><strong>86.79</strong>
233
  </td>
234
- <td><strong>86.86</strong>
235
  </td>
236
  <td><strong>100.0%</strong>
237
  </td>
238
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
  </table>
240
 
241
 
@@ -317,4 +440,39 @@ lm_eval \
317
  --tasks truthfulqa \
318
  --num_fewshot 0 \
319
  --batch_size auto
320
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  - **Model Optimizations:**
26
  - **Weight quantization:** FP8
27
  - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct), this models is intended for assistant-like chat.
29
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
  - **Release Date:** 8/22/2024
31
  - **Version:** 1.1
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
+ This model is a quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-405B-Instruct-FP8-dynamic achieves 99.0% recovery for the Arena-Hard evaluation, 100.0% for OpenLLM v1 (using Meta's prompting when available), 99.9% for OpenLLM v2, 100.2% for HumanEval pass@1, and 101.1% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
139
 
140
  ## Evaluation
141
 
142
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
143
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
144
+
145
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
146
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
147
+ We report below the scores obtained in each judgement and the average.
148
+
149
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
150
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
151
+
152
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
153
+
154
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
155
 
156
  ### Accuracy
157
 
 
162
  </td>
163
  <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
164
  </td>
165
+ <td><strong>Meta-Llama-3.1-405B-Instruct-FP8-dynamic (this model)</strong>
166
  </td>
167
  <td><strong>Recovery</strong>
168
  </td>
169
  </tr>
170
+ <tr>
171
+ <td><strong>Arena Hard</strong>
172
+ </td>
173
+ <td>67.4 (67.3 / 67.5)
174
+ </td>
175
+ <td>66.7 (66.7 / 66.6)
176
+ </td>
177
+ <td>99.0%
178
+ </td>
179
+ </tr>
180
+ <tr>
181
+ <td><strong>OpenLLM v1</strong>
182
+ </td>
183
+ </tr>
184
  <tr>
185
  <td>MMLU (5-shot)
186
  </td>
187
+ <td>87.4
188
  </td>
189
+ <td>87.5
190
  </td>
191
  <td>100.0%
192
  </td>
 
194
  <tr>
195
  <td>MMLU-cot (0-shot)
196
  </td>
197
+ <td>88.1
198
  </td>
199
+ <td>88.1
200
  </td>
201
  <td>100.0%
202
  </td>
 
204
  <tr>
205
  <td>ARC Challenge (0-shot)
206
  </td>
207
+ <td>95.0
208
  </td>
209
+ <td>95.0
210
  </td>
211
  <td>100.0%
212
  </td>
 
214
  <tr>
215
  <td>GSM-8K-cot (8-shot, strict-match)
216
  </td>
217
+ <td>96.0
218
  </td>
219
+ <td>95.8
220
  </td>
221
+ <td>99.8%
222
  </td>
223
  </tr>
224
  <tr>
225
  <td>Hellaswag (10-shot)
226
  </td>
227
+ <td>88.5
228
  </td>
229
+ <td>88.5
230
  </td>
231
+ <td>99.9%
232
  </td>
233
  </tr>
234
  <tr>
235
  <td>Winogrande (5-shot)
236
  </td>
237
+ <td>87.2
238
  </td>
239
+ <td>88.0
240
  </td>
241
  <td>100.9%
242
  </td>
 
244
  <tr>
245
  <td>TruthfulQA (0-shot, mc2)
246
  </td>
247
+ <td>65.3
248
  </td>
249
+ <td>65.3
250
  </td>
251
+ <td>99.9%
252
  </td>
253
  </tr>
254
  <tr>
255
  <td><strong>Average</strong>
256
  </td>
257
+ <td><strong>86.8</strong>
258
  </td>
259
+ <td><strong>86.9</strong>
260
  </td>
261
  <td><strong>100.0%</strong>
262
  </td>
263
  </tr>
264
+ <tr>
265
+ <td><strong>OpenLLM v2</strong>
266
+ </td>
267
+ </tr>
268
+ <tr>
269
+ <td>MMLU-Pro (5-shot)
270
+ </td>
271
+ <td>59.7
272
+ </td>
273
+ <td>59.4
274
+ </td>
275
+ <td>99.4%
276
+ </td>
277
+ </tr>
278
+ <tr>
279
+ <td>IFEval (0-shot)
280
+ </td>
281
+ <td>87.7
282
+ </td>
283
+ <td>86.8
284
+ </td>
285
+ <td>99.0%
286
+ </td>
287
+ </tr>
288
+ <tr>
289
+ <td>BBH (3-shot)
290
+ </td>
291
+ <td>67.0
292
+ </td>
293
+ <td>67.1
294
+ </td>
295
+ <td>100.1%
296
+ </td>
297
+ </tr>
298
+ <tr>
299
+ <td>Math-|v|-5 (4-shot)
300
+ </td>
301
+ <td>39.0
302
+ </td>
303
+ <td>38.8
304
+ </td>
305
+ <td>99.7%
306
+ </td>
307
+ </tr>
308
+ <tr>
309
+ <td>GPQA (0-shot)
310
+ </td>
311
+ <td>19.5
312
+ </td>
313
+ <td>19.0
314
+ </td>
315
+ <td>97.4%
316
+ </td>
317
+ </tr>
318
+ <tr>
319
+ <td>MuSR (0-shot)
320
+ </td>
321
+ <td>19.5
322
+ </td>
323
+ <td>20.8
324
+ </td>
325
+ <td>106.9%
326
+ </td>
327
+ </tr>
328
+ <tr>
329
+ <td><strong>Average</strong>
330
+ </td>
331
+ <td><strong>48.7</strong>
332
+ </td>
333
+ <td><strong>48.7</strong>
334
+ </td>
335
+ <td><strong>99.9%</strong>
336
+ </td>
337
+ </tr>
338
+ <tr>
339
+ <td><strong>Coding</strong>
340
+ </td>
341
+ </tr>
342
+ <tr>
343
+ <td>HumanEval pass@1
344
+ </td>
345
+ <td>86.8
346
+ </td>
347
+ <td>87.0
348
+ </td>
349
+ <td>100.2%
350
+ </td>
351
+ </tr>
352
+ <tr>
353
+ <td>HumanEval+ pass@1
354
+ </td>
355
+ <td>80.1
356
+ </td>
357
+ <td>81.0
358
+ </td>
359
+ <td>101.1%
360
+ </td>
361
+ </tr>
362
  </table>
363
 
364
 
 
440
  --tasks truthfulqa \
441
  --num_fewshot 0 \
442
  --batch_size auto
443
+ ```
444
+
445
+ #### OpenLLM v2
446
+ ```
447
+ lm_eval \
448
+ --model vllm \
449
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=8,enable_chunked_prefill=True \
450
+ --apply_chat_template \
451
+ --fewshot_as_multiturn \
452
+ --tasks leaderboard \
453
+ --batch_size auto
454
+ ```
455
+
456
+ #### HumanEval and HumanEval+
457
+ ##### Generation
458
+ ```
459
+ python3 codegen/generate.py \
460
+ --model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic \
461
+ --bs 16 \
462
+ --temperature 0.2 \
463
+ --n_samples 50 \
464
+ --root "." \
465
+ --dataset humaneval \
466
+ --tp 8
467
+ ```
468
+ ##### Sanitization
469
+ ```
470
+ python3 evalplus/sanitize.py \
471
+ humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-FP8-dynamic_vllm_temp_0.2
472
+ ```
473
+ ##### Evaluation
474
+ ```
475
+ evalplus.evaluate \
476
+ --dataset humaneval \
477
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-FP8-dynamic_vllm_temp_0.2-sanitized
478
+ ```