Reproducing Evaluation with lighteval
#1
by
PatrickHaller
- opened
Hey!
For reproducibilities sake can you verify if this is the right configuration for evaluating with lighteval
?
helm|hellaswag|0|0
lighteval|arc:easy|0|0
leaderboard|arc:challenge|0|0
helm|mmlu|0|0
helm|piqa|0|0
helm|commonsenseqa|0|0
lighteval|triviaqa|0|0
leaderboard|winogrande|0|0
lighteval|openbookqa|0|0
leaderboard|gsm8k|5|0
Furthermore:
- Did you manually calculate average over the accuracy for
easy
andchallenge
forARC
? - What metrics did you report? Is it all accuracy?
Greetings,
Patrick
It also seems that some numbers might be wrong:
Winogrande 52.5 -> 54.62
GSM8K 3.2 -> 0.32
PIQA 71.3 -> 3.1 (em), 9.0 (qem), 3.8 (pem), 19.79 (pqem)
ETC.
Some numbers seems to be wildly different from what you reported....
| Task |Version| Metric |Value | |Stderr|
|-----------------------------------------------|------:|--------|-----:|---|-----:|
|all | |em |0.1994|± |0.0285|
| | |qem |0.2052|± |0.0282|
| | |pem |0.2423|± |0.0308|
| | |pqem |0.4098|± |0.0352|
| | |acc |0.4650|± |0.0142|
| | |acc_norm|0.4796|± |0.0151|
|helm:commonsenseqa:0 | 0|em |0.1949|± |0.0113|
| | |qem |0.1974|± |0.0114|
| | |pem |0.1949|± |0.0113|
| | |pqem |0.3129|± |0.0133|
|helm:hellaswag:0 | 0|em |0.2173|± |0.0041|
| | |qem |0.2404|± |0.0043|
| | |pem |0.2297|± |0.0042|
| | |pqem |0.3162|± |0.0046|
|helm:mmlu:_average:0 | |em |0.2021|± |0.0297|
| | |qem |0.2109|± |0.0303|
| | |pem |0.2469|± |0.0321|
| | |pqem |0.4168|± |0.0366|
--- MMLU subs ---
|helm:piqa:0 | 0|em |0.0311|± |0.0025|
| | |qem |0.0904|± |0.0041|
| | |pem |0.0386|± |0.0027|
| | |pqem |0.1979|± |0.0057|
|leaderboard:arc:challenge:0 | 0|acc |0.3660|± |0.0141|
| | |acc_norm|0.3848|± |0.0142|
|leaderboard:gsm8k:5 | 0|qem |0.0030|± |0.0015|
|leaderboard:winogrande:0 | 0|acc |0.5462|± |0.0140|
|lighteval:arc:easy:0 | 0|acc |0.7016|± |0.0094|
| | |acc_norm|0.6801|± |0.0096|
|lighteval:openbookqa:0 | 0|acc |0.2460|± |0.0193|
| | |acc_norm|0.3740|± |0.0217|
|lighteval:triviaqa:0 | 0|qem |0.1699|± |0.0028|