Using the Python API

Lighteval can be used from a custom Python script. To evaluate a model, you will need to set up an EvaluationTracker, PipelineParameters, a model or a model_config, and a Pipeline.

After that, simply run the pipeline and save the results.

import lighteval
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.vllm.vllm_model import VLLMModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
from lighteval.utils.imports import is_package_available

if is_package_available("accelerate"):
    from datetime import timedelta
    from accelerate import Accelerator, InitProcessGroupKwargs
    accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=3000))])
else:
    accelerator = None

def main():
    evaluation_tracker = EvaluationTracker(
        output_dir="./results",
        save_details=True,
        push_to_hub=True,
        hub_results_org="your_username",  # Replace with your actual username
    )

    pipeline_params = PipelineParameters(
        launcher_type=ParallelismManager.ACCELERATE,
        custom_tasks_directory=None,  # Set to path if using custom tasks
        # Remove the parameter below once your configuration is tested
        max_samples=10
    )

    model_config = VLLMModelConfig(
        model_name="HuggingFaceH4/zephyr-7b-beta",
        dtype="float16",
    )

    task = "lighteval|gsm8k|5"

    pipeline = Pipeline(
        tasks=task,
        pipeline_parameters=pipeline_params,
        evaluation_tracker=evaluation_tracker,
        model_config=model_config,
    )

    pipeline.evaluate()
    pipeline.save_and_push_results()
    pipeline.show_results()

if __name__ == "__main__":
    main()

Key Components

EvaluationTracker

The EvaluationTracker handles logging and saving evaluation results. It can save results locally and optionally push them to the Hugging Face Hub.

PipelineParameters

PipelineParameters configures how the evaluation pipeline runs, including parallelism settings and task configuration.

Model Configuration

Model configurations define the model to be evaluated, including the model name, data type, and other model-specific parameters. Different backends (VLLM, Transformers, etc.) have their own configuration classes.

Pipeline

The Pipeline orchestrates the entire evaluation process, taking the tasks, model configuration, and parameters to run the evaluation.

Running Multiple Tasks

You can evaluate on multiple tasks by providing a comma-separated list or a file path:

# Multiple tasks as comma-separated string
tasks = "lighteval|aime24|0,lighteval|aime25|0"

# Or load from a file
tasks = "./path/to/tasks.txt"

pipeline = Pipeline(
    tasks=tasks,
    # ... other parameters
)

Custom Tasks

To use custom tasks, set the custom_tasks_directory parameter to the path containing your custom task definitions:

pipeline_params = PipelineParameters(
    custom_tasks_directory="./path/to/custom/tasks",
    # ... other parameters
)

For more information on creating custom tasks, see the Adding a Custom Task guide.

< > Update on GitHub