Lighteval documentation

Use VLLM as backend

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.7.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Use VLLM as backend

Lighteval allows you to use vllm as backend allowing great speedups. To use, simply change the model_args to reflect the arguments you want to pass to vllm.

lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"

vllm is able to distribute the model across multiple GPUs using data parallelism, pipeline parallelism or tensor parallelism. You can choose the parallelism method by setting in the the model_args.

For example if you have 4 GPUs you can split it across using tensor_parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Or, if your model fits on a single GPU, you can use data_parallelism to speed up the evaluation:

lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Use a config file

For more advanced configurations, you can use a config file for the model. An example of a config file is shown below and can be found at examples/model_configs/vllm_model_config.yaml.

lighteval vllm \
    "examples/model_configs/vllm_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0|0"
model: # Model specific parameters
  base_params:
    model_args: "pretrained=HuggingFaceTB/SmolLM-1.7B,revision=main,dtype=bfloat16" # Model args that you would pass in the command line
  generation: # Generation specific parameters
    temperature: 0.3
    repetition_penalty: 1.0
    frequency_penalty: 0.0
    presence_penalty: 0.0
    seed: 42
    top_k: 0
    min_p: 0.0
    top_p: 0.9

In the case of OOM issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilization parameter.

Dynamically changing the metric configuration

For special kinds of metrics like Pass@K or LiveCodeBench’s codegen metric, you may need to pass specific values like the number of generations. This can be done in the yaml file in the following way:

model: # Model specific parameters
  base_params:
    model_args: "pretrained=HuggingFaceTB/SmolLM-1.7B,revision=main,dtype=bfloat16" # Model args that you would pass in the command line
  generation: # Generation specific parameters
    temperature: 0.3
    repetition_penalty: 1.0
    frequency_penalty: 0.0
    presence_penalty: 0.0
    seed: 42
    top_k: 0
    min_p: 0.0
    top_p: 0.9
  metric_options: # Optional metric arguments
    codegen_pass@1:16:
      num_samples: 16

An optional key metric_options can be passed in the yaml file, using the name of the metric or metrics, as defined in the Metric.metric_name. In this case, the codegen_pass@1:16 metric defined in our tasks will have the num_samples updated to 16, independently of the number defined by default.

< > Update on GitHub