Opensource the evaluation code

#60
by memray - opened

I wonder if the code for evaluating LLMs can be released? Is it completely based on EleutherAI/lm-evaluation-harness? It's not clear how the in-context examples are selected.

Thanks!

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?

This comment has been hidden

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Yep, I could also run the evaluations using the ElutherAI's repository. But I do not find which metrics are used. Is it documented somewhere I am not aware of?

Yes, go to files (next to app) of open_llm_leaderboard then utils.py file. They list the benchmarks and the metrics.

And also EleutherAI/lm-evaluation-harness doesn't provide good support for evaluating huge models (>20B). I will be great if open_llm_leaderboard can share their pipeline.

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?

Thanks!

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

Hi, could you please share the exact command you ran? I found "hendrycks" for MMLU, but there are a ton of different subversion of hendrycks (like hendrycksTest-abstract_algebra). Is there a way to run them all?

Thanks!

Hi @64bits

MMLU has 57 different tasks. They are formatted as hendrycksTest-{sub} in the lm evaluation harness where sub is a topic like abstract_algebra. You need to evaluate on all tasks and compute the average of acc_normacross tasks. You can write a bash script to run them sequentially which will be very slow. You create an array of topics and loop over them. I ran the evaluation in parallel across tasks on a Slurm based computer cluster.

Could you tell me the name of llama-7b hugging face model, because I'm struggling with the result of yahma/llama-7b-hf is not match with leader board result. @itanh0b . It would be very helpful if you can kindly share the model's name or the command you used.

Hello @v-xchen-v ,

I converted LLaMA model to huggingface format myself, so I do not know how yahma/llama-7b-hf would do. Are you getting worse or better results? The commit which reproduces the Open LLM Leaderboard is 441e6ac.

Open LLM Leaderboard org

@memray The code we run is at the moment based on the Eleuther AI Harness (+ some custom logic to run things faster on our cluster) - using it should give you the exact same results and numbers!
The in-context examples are selected by default in the Eleuther AI Harness here.

Here is the command that you can use to evaluate your models. MODEL_PATH is the folder where the weights and config.json file is, or it can be a huggingface model ID that will be downloaded automatically. MODEL is just a name for the experiment you're using. SHOTS is the number of few shot used per benchmark. subject is which task you want to evaluate on. This command also allows to run on multi-GPU in case the model you're evaluating is >30B. Please make sure you're using commit 441e6ac to reproduce the numbers on the leaderboard.

python main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained=$MODEL_PATH,trust_remote_code=True,use_accelerate=True --tasks $subject --num_fewshot $SHOTS --output_path ./$MODEL-results/$MODEL-$subject-$SHOTS-shots.json

memray changed discussion status to closed

@itanh0b Do you know how the MMLU dataset is evaluated? I am using the following command to evaluate the MMLU/hendrycksTest-(sub) 57 tasks and take an average. I got a score of 25.69% for GPT2. However, in the leaderboard, it is 27.5%. For arc_challenge (s-25), hellaswag (s-10), truthfulqa_mc (s-0), I am able to reproduce the results on the leaderboard for GPT2.

python main.py --model hf-causal --model_args pretrained=gpt2 --tasks hendrycksTest-* --device cuda:0 --num_fewshot 5

Hello @viataur ,

I'm using the commit 441e6ac of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.

Hello @viataur ,

I'm using the commit 441e6ac of lm-evaluation-harness to reproduce the numbers in the leaderboard. Later commits lead to different results on MMLU.

@itanh0b Thank you so much for the information, I will try it out. Do you happen to know the reason for the different results on MMLU, is it because the dataset is changed, or the code is changed for a different calculation?

@viataur the way they extract the continuation changed to suit the LLaMA model tokenizer. Here is the commit that affected MMLU results for LLaMA models. It might be the same one that affects MMLU results for gpt2.

@itanh0b Thank you so much! I can confirm that the commit 441e6ac can reproduce the results of GPT on the Leaderboard.

It is actually this PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497 affects the results for GPT2. I see the difference is because the prompt format is changed. Original prompt for answer uses choices text (e.g. "mesoderm formation and occurs after neurulation.") for evaluation. While after PR 497, the latest prompt for answer uses choices letter (e.g. "A") for evaluation. If you run git checkout 441e6ac lm_eval/tasks/hendrycks_test.py on the latest code, it can also produce the MMLU results of GPT2 on the Leaderboard.

I replicated the results of LLaMA and Vicuna on this leaderboard perfectly using the EleutherAI/lm-evaluation-harness. The metrics are acc_norm for ARC-Challenge, MMLU, and Hellaswag, and mc2 for truthfulQA_mc.

I am beginner tot this, could you please tell us what are steps you followed for LLaMA evaluation using EleutherAI/lm-evaluation-harness

Open LLM Leaderboard org

Hi @arpithaabhishekgudekote ,
All the steps to reproduce this are in the About tab of the leaderboard :)

Sign up or log in to comment