alexmarques
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -147,7 +147,7 @@ The model generated a single answer for each prompt form Arena-Hard, and each an
|
|
147 |
We report below the scores obtained in each judgement and the average.
|
148 |
|
149 |
OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
|
150 |
-
This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-
|
151 |
|
152 |
HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
|
153 |
|
@@ -155,7 +155,6 @@ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](ht
|
|
155 |
|
156 |
### Accuracy
|
157 |
|
158 |
-
#### Open LLM Leaderboard evaluation scores
|
159 |
<table>
|
160 |
<tr>
|
161 |
<td><strong>Benchmark</strong>
|
|
|
147 |
We report below the scores obtained in each judgement and the average.
|
148 |
|
149 |
OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
|
150 |
+
This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-405B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
|
151 |
|
152 |
HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
|
153 |
|
|
|
155 |
|
156 |
### Accuracy
|
157 |
|
|
|
158 |
<table>
|
159 |
<tr>
|
160 |
<td><strong>Benchmark</strong>
|