acc or acc_norm?

#106
by paopao0226 - opened

Hello, when testing on arc dataset, there are two scores(acc and acc_norm), so which one does the leaderboard use?

In the auto_leaderboard files in load_results.py, everything but mmlu is using acc_norm.

# clone / pull the lmeh eval data
METRICS = ["acc_norm", "acc_norm", "acc_norm", "mc2"]
BENCHMARKS = ["arc_challenge", "hellaswag", "hendrycks", "truthfulqa_mc"]
BENCH_TO_NAME = {
    "arc_challenge": AutoEvalColumn.arc.name,
    "hellaswag": AutoEvalColumn.hellaswag.name,
    "hendrycks": AutoEvalColumn.mmlu.name,
    "truthfulqa_mc": AutoEvalColumn.truthfulqa.name,
}

@lilloukas Okkkk thanks! I think it's the result

Open LLM Leaderboard org

Hello @paopao0226 , note that we just changed the metric we used for MMLU. The file now reads:

# clone / pull the lmeh eval data
METRICS = ["acc_norm", "acc_norm", "acc", "mc2"]
BENCHMARKS = ["arc:challenge", "hellaswag", "hendrycksTest", "truthfulqa:mc"]
BENCH_TO_NAME = {
    "arc:challenge": AutoEvalColumn.arc.name,
    "hellaswag": AutoEvalColumn.hellaswag.name,
    "hendrycksTest": AutoEvalColumn.mmlu.name,
    "truthfulqa:mc": AutoEvalColumn.truthfulqa.name,
}

@SaylorTwift Ok, so is this the latest version of what leaderboard using?

Open LLM Leaderboard org

@paopao0226 Yes! :) And you can find the details in the About tab if you need

clefourrier changed discussion status to closed

@clefourrier Thanks!the About tab becomes more informative.

Sign up or log in to comment