Lighteval documentation
Quick Tour
Quick Tour
We recommend using the --help flag to get more information about the
available options for each command.
lighteval --help
Lighteval can be used with several different commands, each optimized for different evaluation scenarios.
Available Commands
Evaluation Backends
- lighteval accelerate: Evaluate models on CPU or one or more GPUs using 🤗 Accelerate
- lighteval nanotron: Evaluate models in distributed settings using ⚡️ Nanotron
- lighteval vllm: Evaluate models on one or more GPUs using 🚀 VLLM
- lighteval custom: Evaluate custom models (can be anything)
- lighteval sglang: Evaluate models using SGLang as backend
- lighteval endpoint: Evaluate models using various endpoints as backend- lighteval endpoint inference-endpoint: Evaluate models using Hugging Face’s Inference Endpoints API
- lighteval endpoint tgi: Evaluate models using 🔗 Text Generation Inference running locally
- lighteval endpoint litellm: Evaluate models on any compatible API using LiteLLM
- lighteval endpoint inference-providers: Evaluate models using HuggingFace’s inference providers as backend
 
Evaluation Utils
- lighteval baseline: Compute baselines for given tasks
Utils
- lighteval tasks: List or inspect tasks- lighteval tasks list: List all available tasks
- lighteval tasks inspect: Inspect a specific task to see its configuration and samples
- lighteval tasks create: Create a new task from a template
 
Basic Usage
To evaluate GPT-2 on the Truthful QA benchmark with 🤗
Accelerate, run:
lighteval accelerate \
     "model_name=openai-community/gpt2" \
     "leaderboard|truthfulqa:mc|0"Here, we first choose a backend (either accelerate, nanotron, endpoint, or vllm), and then specify the model and task(s) to run.
Task Specification
The syntax for the task specification might be a bit hard to grasp at first. The format is as follows:
{suite}|{task}|{num_few_shot}Tasks have a function applied at the sample level and one at the corpus level. For example,
- an exact match can be applied per sample, then averaged over the corpus to give the final score
- samples can be left untouched before applying Corpus BLEU at the corpus level etc.
If the task you are looking at has a sample level function (sample_level_fn) which can be parametrized, you can pass parameters in the CLI.
For example
{suite}|{task}@{parameter_name1}={value1}@{parameter_name2}={value2},...|0All officially supported tasks can be found at the tasks_list and in the extended folder. Moreover, community-provided tasks can be found in the community folder.
For more details on the implementation of the tasks, such as how prompts are constructed or which metrics are used, you can examine the implementation file.
Running Multiple Tasks
Running multiple tasks is supported, either with a comma-separated list or by specifying a file path.
The file should be structured like examples/tasks/recommended_set.txt.
When specifying a path to a file, it should start with ./.
lighteval accelerate \
     "model_name=openai-community/gpt2" \
     ./path/to/lighteval/examples/tasks/recommended_set.txt
# or, e.g., "leaderboard|truthfulqa:mc|0,leaderboard|gsm8k|3"Backend Configuration
General Information
The model-args argument takes a string representing a list of model
arguments. The arguments allowed vary depending on the backend you use and
correspond to the fields of the model configurations.
The model configurations can be found here.
All models allow you to post-process your reasoning model predictions
to remove the thinking tokens from the trace used to compute the metrics,
using --remove-reasoning-tags and --reasoning-tags to specify which
reasoning tags to remove (defaults to <think> and </think>).
Here’s an example with mistralai/Magistral-Small-2507 which outputs custom
thinking tokens:
lighteval vllm \
    "model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
    "lighteval|aime24|0" \
    --remove-reasoning-tags \
    --reasoning-tags="[('[THINK]','[/THINK]')]"Nanotron
To evaluate a model trained with Nanotron on a single GPU:
Nanotron models cannot be evaluated without torchrun.
torchrun --standalone --nnodes=1 --nproc-per-node=1 \
    src/lighteval/__main__.py nanotron \
    --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
    --lighteval-config-path examples/nanotron/lighteval_config_override_template.yamlThe nproc-per-node argument should match the data, tensor, and pipeline
parallelism configured in the lighteval_config_template.yaml file.
That is: nproc-per-node = data_parallelism * tensor_parallelism * pipeline_parallelism.