Use VLLM as backend

Lighteval allows you to use vllm as backend allowing great speedups. To use, simply change the model_args to reflect the arguments you want to pass to vllm.

Documentation for vllm engine args can be found here

lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"

vllm is able to distribute the model across multiple GPUs using data parallelism, pipeline parallelism or tensor parallelism. You can choose the parallelism method by setting in the model_args.

For example if you have 4 GPUs you can split it across using tensor_parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Or, if your model fits on a single GPU, you can use data_parallelism to speed up the evaluation:

lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Use a config file

For more advanced configurations, you can use a config file for the model. An example of a config file is shown below and can be found at examples/model_configs/vllm_model_config.yaml.

lighteval vllm \
    "examples/model_configs/vllm_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0|0"

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    revision: "main"
    dtype: "bfloat16"
    tensor_parallel_size: 1
    data_parallel_size: 1
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    max_model_length: 2048
    swap_space: 4
    seed: 1
    trust_remote_code: True
    use_chat_template: True
    add_special_tokens: True
    multichoice_continuations_start_space: True
    pairwise_tokenization: True
    subfolder: null
    generation_parameters:
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      seed: 42
      stop_tokens: null
      max_new_tokens: 1024
      min_new_tokens: 0

In the case of OOM issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilization parameter.

Dynamically changing the metric configuration

For special kinds of metrics like Pass@K or LiveCodeBench’s codegen metric, you may need to pass specific values like the number of generations. This can be done in the yaml file in the following way:

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    revision: "main"
    dtype: "bfloat16"
    tensor_parallel_size: 1
    data_parallel_size: 1
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    max_model_length: 2048
    swap_space: 4
    seed: 1
    trust_remote_code: True
    use_chat_template: True
    add_special_tokens: True
    multichoice_continuations_start_space: True
    pairwise_tokenization: True
    subfolder: null
    generation_parameters:
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      seed: 42
      stop_tokens: null
      max_new_tokens: 1024
      min_new_tokens: 0
metric_options: # Optional metric arguments
    codegen_pass@1:16:
        num_samples: 16

An optional key metric_options can be passed in the yaml file, using the name of the metric or metrics, as defined in the Metric.metric_name. In this case, the codegen_pass@1:16 metric defined in our tasks will have the num_samples updated to 16, independently of the number defined by default.

< > Update on GitHub