Open LLM Leaderboard v1
Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago when they wanted to reproduce and compare results from several published models. It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful, but in most cases, just using optimized prompts or evaluation setup to give the best chances to the models. They therefore decided to create a place where reference models would be evaluated in the exact same setup (same questions, asked in the same order, etc.) to gather completely reproducible and comparable results; and that’s how the Open LLM Leaderboard was born!
Following a series of highly visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.
Around 300,000 community members used and collaborated on it monthly through submissions and discussions, usually to:
- Find state-of-the-art open-source releases as the leaderboard provides reproducible scores separating marketing fluff from actual progress in the field.
- Evaluate their work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition.
In June 2024, we archived it, and it was replaced by a newer version, but below, you’ll find all relevant information about it!
Tasks
📈 We evaluated models on 6 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
- AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
- HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
- MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
- TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
- Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
- GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
For all these evaluations, a higher score is a better score.
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
Results
You can find:
- detailed numerical results in the
results
Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard-old/results - details on the input/outputs for the models in the
details
of each model, which you can access by clicking the 📄 emoji after the model name - community queries and running status in the
requests
Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard-old/requests If a model’s name contains “Flagged”, this indicates it has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
Reproducibility
To reproduce our results, you could run the following command, using this version of the Eleuther AI Harness:
python main.py --model=hf-causal-experimental \
--model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" \
--tasks=<task_list> \
--num_fewshot=<n_few_shot> \
--batch_size=1 \
--output_path=<output_path>
Note: We evaluated all models on a single node of 8 H100s, so the global batch size was 8 for each evaluation. If you don’t use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
The tasks and few shots parameters are:
- ARC: 25-shot, arc-challenge (
acc_norm
) - HellaSwag: 10-shot, hellaswag (
acc_norm
) - TruthfulQA: 0-shot, truthfulqa-mc (
mc2
) - MMLU: 5-shot, hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions (average of all the results
acc
) - Winogrande: 5-shot, winogrande (
acc
) - GSM8k: 5-shot, gsm8k (
acc
) Side note on the baseline scores: - for log-likelihood evaluation, we select the random baseline
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
Blogs
During the life of the leaderboard, we wrote 2 blogs that you can find here and here
< > Update on GitHub