Update README.md
Browse files
README.md
CHANGED
@@ -94,5 +94,7 @@ The model achieves the following results without any fine-tuning (zero-shot):
|
|
94 |
|arc_easy |acc/acc_norm|0.4381/0.3948 |**0.4651**/**0.4247** |**0.0082**/**0.0029** |
|
95 |
|arc_challenge|acc/acc_norm|0.1903/0.2270 |0.1997/0.2329 |0.4132/0.6256 |
|
96 |
|
97 |
-
To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
|
98 |
-
which can produce results different than those reported in the GPT2 paper.
|
|
|
|
|
|
94 |
|arc_easy |acc/acc_norm|0.4381/0.3948 |**0.4651**/**0.4247** |**0.0082**/**0.0029** |
|
95 |
|arc_challenge|acc/acc_norm|0.1903/0.2270 |0.1997/0.2329 |0.4132/0.6256 |
|
96 |
|
97 |
+
To get these results, we used commit `4f0410a4be0049729078376ce36a42dc308b6e38` of the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
|
98 |
+
which can produce results different than those reported in the GPT2 paper.
|
99 |
+
We added a change [here](https://github.com/EleutherAI/lm-evaluation-harness/compare/master...mathemakitten:lm-evaluation-harness:master) to enable evaluation of the OLM GPT2, which has a very slightly different vocab size.
|
100 |
+
The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.
|