Lighteval documentation

Model Configs

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Model Configs

The model configs are used to define the model and its parameters. All the parameters can be set in the model-args or in the model yaml file (see example here).

Base model config

class lighteval.models.utils.ModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None )

Parameters

  • generation_parameters (GenerationParameters) — Configuration parameters that control text generation behavior, including temperature, top_p, max_new_tokens, etc. Defaults to empty GenerationParameters.
  • system_prompt (str | None) — Optional system prompt to be used with chat models. This prompt sets the behavior and context for the model during evaluation.

Base configuration class for all model types in Lighteval.

This is the foundation class that all specific model configurations inherit from. It provides common functionality for parsing configuration from files and command-line arguments, as well as shared attributes that are used by all models like generation parameters and system prompts.

Methods: from_path(path: str): Load configuration from a YAML file. from_args(args: str): Parse configuration from a command-line argument string. _parse_args(args: str): Static method to parse argument strings into configuration dictionaries.

Example:

# Load from YAML file
config = ModelConfig.from_path("model_config.yaml")

# Load from command line arguments
config = ModelConfig.from_args("model_name=meta-llama/Llama-3.1-8B-Instruct,system_prompt='You are a helpful assistant.',generation_parameters={temperature=0.7}")

# Direct instantiation
config = ModelConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    generation_parameters=GenerationParameters(temperature=0.7),
    system_prompt="You are a helpful assistant."
)

Local Models

Transformers Model

class lighteval.models.transformers.transformers_model.TransformersModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str tokenizer: str | None = None subfolder: str | None = None revision: str = 'main' batch_size: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None max_length: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None model_loading_kwargs: dict = <factory> add_special_tokens: bool = True model_parallel: bool | None = None dtype: str | None = None device: typing.Union[int, str] = 'cuda' trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False multichoice_continuations_start_space: bool | None = None pairwise_tokenization: bool = False )

Parameters

  • model_name (str) — HuggingFace Hub model ID or path to a pre-trained model. This corresponds to the pretrained_model_name_or_path argument in HuggingFace’s from_pretrained method.
  • tokenizer (str | None) — Optional HuggingFace Hub tokenizer ID. If not specified, uses the same ID as model_name. Useful when the tokenizer is different from the model (e.g., for multilingual models).
  • subfolder (str | None) — Subfolder within the model repository. Used when models are stored in subdirectories.
  • revision (str) — Git revision of the model to load. Defaults to “main”.
  • batch_size (PositiveInt | None) — Batch size for model inference. If None, will be automatically determined.
  • max_length (PositiveInt | None) — Maximum sequence length for the model. If None, uses model’s default.
  • model_loading_kwargs (dict) — Additional keyword arguments passed to from_pretrained. Defaults to empty dict.
  • add_special_tokens (bool) — Whether to add special tokens during tokenization. Defaults to True.
  • model_parallel (bool | None) — Whether to use model parallelism across multiple GPUs. If None, automatically determined based on available GPUs and model size.
  • dtype (str | None) — Data type for model weights. Can be “float16”, “bfloat16”, “float32”, “auto”, “4bit”, “8bit”. If “auto”, uses the model’s default dtype.
  • device (Union[int, str]) — Device to load the model on. Can be “cuda”, “cpu”, or GPU index. Defaults to “cuda”.
  • trust_remote_code (bool) — Whether to trust remote code when loading models. Defaults to False.
  • use_chat_template (bool) — Whether to use chat templates for conversation-style prompts. Defaults to False.
  • compile (bool) — Whether to compile the model using torch.compile for optimization. Defaults to False.
  • multichoice_continuations_start_space (bool | None) — Whether to add a space before multiple choice continuations. If None, uses model default. True forces adding space, False removes leading space if present.
  • pairwise_tokenization (bool) — Whether to tokenize context and continuation separately or together. Defaults to False.

Configuration class for HuggingFace Transformers models.

This configuration is used to load and configure models from the HuggingFace Transformers library.

Example:

config = TransformersModelConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    batch_size=4,
    dtype="float16",
    use_chat_template=True,
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Note: This configuration supports quantization (4-bit and 8-bit) through the dtype parameter. When using quantization, ensure you have the required dependencies installed (bitsandbytes for 4-bit/8-bit quantization).

class lighteval.models.transformers.adapter_model.AdapterModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str tokenizer: str | None = None subfolder: str | None = None revision: str = 'main' batch_size: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None max_length: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None model_loading_kwargs: dict = <factory> add_special_tokens: bool = True model_parallel: bool | None = None dtype: str | None = None device: typing.Union[int, str] = 'cuda' trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False multichoice_continuations_start_space: bool | None = None pairwise_tokenization: bool = False base_model: str adapter_weights: bool )

Parameters

  • base_model (str) — HuggingFace Hub model ID or path to the base model. This is the original pre-trained model that the adapter was trained on.
  • adapter_weights (bool) — Flag indicating that this is an adapter model. Must be set to True.

Configuration class for PEFT (Parameter-Efficient Fine-Tuning) adapter models.

This configuration is used to load models that have been fine-tuned using PEFT adapters, such as LoRA, AdaLoRA, or other parameter-efficient fine-tuning methods. The adapter weights are merged with the base model during loading for efficient inference.

Note:

  • Requires the peft library to be installed, pip install lighteval[adapters]
  • Adapter models have the specificity that they look at the base model (= the parent) for the tokenizer and config

class lighteval.models.transformers.delta_model.DeltaModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str tokenizer: str | None = None subfolder: str | None = None revision: str = 'main' batch_size: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None max_length: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None model_loading_kwargs: dict = <factory> add_special_tokens: bool = True model_parallel: bool | None = None dtype: str | None = None device: typing.Union[int, str] = 'cuda' trust_remote_code: bool = False use_chat_template: bool = False compile: bool = False multichoice_continuations_start_space: bool | None = None pairwise_tokenization: bool = False base_model: str delta_weights: bool )

Parameters

  • base_model (str) — HuggingFace Hub model ID or path to the base model. This is the original pre-trained model that the delta was computed from.
  • delta_weights (bool) — Flag indicating that this is a delta model. Must be set to True.

Configuration class for delta models (weight difference models).

This configuration is used to load models that represent the difference between a fine-tuned model and its base model. The delta weights are added to the base model during loading to reconstruct the full fine-tuned model.

VLLM Model

class lighteval.models.vllm.vllm_model.VLLMModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str revision: str = 'main' dtype: str = 'bfloat16' tensor_parallel_size: typing.Annotated[int, Gt(gt=0)] = 1 data_parallel_size: typing.Annotated[int, Gt(gt=0)] = 1 pipeline_parallel_size: typing.Annotated[int, Gt(gt=0)] = 1 gpu_memory_utilization: typing.Annotated[float, Ge(ge=0)] = 0.9 max_model_length: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None quantization: str | None = None load_format: str | None = None swap_space: typing.Annotated[int, Gt(gt=0)] = 4 seed: typing.Annotated[int, Ge(ge=0)] = 1234 trust_remote_code: bool = False add_special_tokens: bool = True multichoice_continuations_start_space: bool = True pairwise_tokenization: bool = False max_num_seqs: typing.Annotated[int, Gt(gt=0)] = 128 max_num_batched_tokens: typing.Annotated[int, Gt(gt=0)] = 2048 subfolder: str | None = None use_chat_template: bool = False is_async: bool = False )

Parameters

  • model_name (str) — HuggingFace Hub model ID or path to the model to load.
  • revision (str) — Git revision of the model. Defaults to “main”.
  • dtype (str) — Data type for model weights. Defaults to “bfloat16”. Options: “float16”, “bfloat16”, “float32”.
  • tensor_parallel_size (PositiveInt) — Number of GPUs to use for tensor parallelism. Defaults to 1.
  • data_parallel_size (PositiveInt) — Number of GPUs to use for data parallelism. Defaults to 1.
  • pipeline_parallel_size (PositiveInt) — Number of GPUs to use for pipeline parallelism. Defaults to 1.
  • gpu_memory_utilization (NonNegativeFloat) — Fraction of GPU memory to use. Lower this if running out of memory. Defaults to 0.9.
  • max_model_length (PositiveInt | None) — Maximum sequence length for the model. If None, automatically inferred. Reduce this if encountering OOM issues (4096 is usually sufficient).
  • quantization (str | None) — Quantization method.
  • load_format (str | None) — The format of the model weights to load. choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer.
  • swap_space (PositiveInt) — CPU swap space size in GiB per GPU. Defaults to 4.
  • seed (NonNegativeInt) — Random seed for reproducibility. Defaults to 1234.
  • trust_remote_code (bool) — Whether to trust remote code when loading models. Defaults to False.
  • add_special_tokens (bool) — Whether to add special tokens during tokenization. Defaults to True.
  • multichoice_continuations_start_space (bool) — Whether to add a space before multiple choice continuations. Defaults to True.
  • pairwise_tokenization (bool) — Whether to tokenize context and continuation separately for loglikelihood evals. Defaults to False.
  • max_num_seqs (PositiveInt) — Maximum number of sequences per iteration. Controls batch size at prefill stage. Defaults to 128.
  • max_num_batched_tokens (PositiveInt) — Maximum number of tokens per batch. Defaults to 2048.
  • subfolder (str | None) — Subfolder within the model repository. Defaults to None.
  • use_chat_template (bool) — Whether to use chat templates for conversation-style prompts. Defaults to False.
  • is_async (bool) — Whether to use the async version of VLLM. Defaults to False.

Configuration class for VLLM inference engine.

This configuration is used to load and configure models using the VLLM inference engine, which provides high-performance inference for large language models with features like PagedAttention, continuous batching, and efficient memory management.

vllm doc: https://docs.vllm.ai/en/v0.7.1/serving/engine_args.html

Example:

config = VLLMModelConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.8,
    max_model_length=4096,
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

SGLang Model

class lighteval.models.sglang.sglang_model.SGLangModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str load_format: str = 'auto' dtype: str = 'auto' tp_size: typing.Annotated[int, Gt(gt=0)] = 1 dp_size: typing.Annotated[int, Gt(gt=0)] = 1 context_length: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = None random_seed: typing.Optional[typing.Annotated[int, Gt(gt=0)]] = 1234 trust_remote_code: bool = False use_chat_template: bool = False device: str = 'cuda' skip_tokenizer_init: bool = False kv_cache_dtype: str = 'auto' add_special_tokens: bool = True pairwise_tokenization: bool = False sampling_backend: str | None = None attention_backend: str | None = None mem_fraction_static: typing.Annotated[float, Gt(gt=0)] = 0.8 chunked_prefill_size: typing.Annotated[int, Gt(gt=0)] = 4096 )

Parameters

  • model_name (str) — HuggingFace Hub model ID or path to the model to load.
  • load_format (str) — The format of the model weights to load. choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer.
  • dtype (str) — Data type for model weights. Defaults to “auto”. Options: “auto”, “float16”, “bfloat16”, “float32”.
  • tp_size (PositiveInt) — Number of GPUs to use for tensor parallelism. Defaults to 1.
  • dp_size (PositiveInt) — Number of GPUs to use for data parallelism. Defaults to 1.
  • context_length (PositiveInt | None) — Maximum context length for the model.
  • random_seed (PositiveInt | None) — Random seed for reproducibility. Defaults to 1234.
  • trust_remote_code (bool) — Whether to trust remote code when loading models. Defaults to False.
  • use_chat_template (bool) — Whether to use chat templates for conversation-style prompts. Defaults to False.
  • device (str) — Device to load the model on. Defaults to “cuda”.
  • skip_tokenizer_init (bool) — Whether to skip tokenizer initialization. Defaults to False.
  • kv_cache_dtype (str) — Data type for key-value cache. Defaults to “auto”.
  • add_special_tokens (bool) — Whether to add special tokens during tokenization. Defaults to True.
  • pairwise_tokenization (bool) — Whether to tokenize context and continuation separately for loglikelihood evals. Defaults to False.
  • sampling_backend (str | None) — Sampling backend to use. If None, uses default.
  • attention_backend (str | None) — Attention backend to use. If None, uses default.
  • mem_fraction_static (PositiveFloat) — Fraction of GPU memory to use for static allocation. Defaults to 0.8.
  • chunked_prefill_size (PositiveInt) — Size of chunks for prefill operations. Defaults to 4096.

Configuration class for SGLang inference engine.

This configuration is used to load and configure models using the SGLang inference engine, which provides high-performance inference.

sglang doc: https://docs.sglang.ai/index.html#

Example:

config = SGLangModelConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=2,
    context_length=8192,
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Dummy Model

class lighteval.models.dummy.dummy_model.DummyModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None seed: int = 42 )

Parameters

  • seed (int) — Random seed for reproducible dummy responses. Defaults to 42. This seed controls the randomness of the generated responses and log probabilities.

Configuration class for dummy models used for testing and baselines.

This configuration is used to create dummy models that generate random responses or baselines for evaluation purposes. Useful for testing evaluation pipelines without requiring actual model inference.

Example:

config = DummyModelConfig(
    seed=123,
)

Endpoints-based Models

Inference Providers Model

class lighteval.models.endpoints.inference_providers_model.InferenceProvidersModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str provider: str timeout: int | None = None proxies: typing.Optional[typing.Any] = None org_to_bill: str | None = None parallel_calls_count: typing.Annotated[int, Ge(ge=0)] = 10 )

Parameters

  • model_name (str) — Name or identifier of the model to use.
  • provider (str) — Name of the inference provider. Examples: “together”, “anyscale”, “runpod”, etc.
  • timeout (int | None) — Request timeout in seconds. If None, uses provider default.
  • proxies (Any | None) — Proxy configuration for requests. Can be a dict or proxy URL string.
  • org_to_bill (str | None) — Organization to bill for API usage. If None, bills the user’s account.
  • parallel_calls_count (NonNegativeInt) — Number of parallel API calls to make. Defaults to 10. Higher values increase throughput but may hit rate limits.

Configuration class for HuggingFace’s inference providers (like Together AI, Anyscale, etc.).

inference providers doc: https://huggingface.co/docs/inference-providers/en/index

Example:

config = InferenceProvidersModelConfig(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    provider="together",
    parallel_calls_count=5,
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Note:

  • Requires HF API keys to be set in environment variable
  • Different providers have different rate limits and pricing

InferenceEndpointModel

class lighteval.models.endpoints.endpoint_model.InferenceEndpointModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None endpoint_name: str | None = None model_name: str | None = None reuse_existing: bool = False accelerator: str = 'gpu' dtype: str | None = None vendor: str = 'aws' region: str = 'us-east-1' instance_size: str | None = None instance_type: str | None = None framework: str = 'pytorch' endpoint_type: str = 'protected' add_special_tokens: bool = True revision: str = 'main' namespace: str | None = None image_url: str | None = None env_vars: dict | None = None batch_size: int = 1 )

Parameters

  • endpoint_name (str | None) — Name for the inference endpoint. If None, auto-generated from model_name.
  • model_name (str | None) — HuggingFace Hub model ID to deploy. Required if endpoint_name is None.
  • reuse_existing (bool) — Whether to reuse an existing endpoint with the same name. Defaults to False.
  • accelerator (str) — Type of accelerator to use. Defaults to “gpu”. Options: “gpu”, “cpu”.
  • dtype (str | None) — Model data type. If None, uses model default. Options: “float16”, “bfloat16”, “awq”, “gptq”, “8bit”, “4bit”.
  • vendor (str) — Cloud vendor for the endpoint. Defaults to “aws”. Options: “aws”, “azure”, “gcp”.
  • region (str) — Cloud region for the endpoint. Defaults to “us-east-1”.
  • instance_size (str | None) — Instance size for the endpoint. If None, auto-scaled.
  • instance_type (str | None) — Instance type for the endpoint. If None, auto-scaled.
  • framework (str) — ML framework to use. Defaults to “pytorch”.
  • endpoint_type (str) — Type of endpoint. Defaults to “protected”. Options: “protected”, “public”.
  • add_special_tokens (bool) — Whether to add special tokens during tokenization. Defaults to True.
  • revision (str) — Git revision of the model. Defaults to “main”.
  • namespace (str | None) — Namespace for the endpoint. If None, uses current user’s namespace.
  • image_url (str | None) — Custom Docker image URL. If None, uses default TGI image.
  • env_vars (dict | None) — Additional environment variables for the endpoint.
  • batch_size (int) — Batch size for requests. Defaults to 1.

Configuration class for HuggingFace Inference Endpoints (dedicated infrastructure).

This configuration is used to create and manage dedicated inference endpoints on HuggingFace’s infrastructure. These endpoints provide dedicated compute resources and can handle larger batch sizes and higher throughput.

Methods: model_post_init(): Validates configuration and ensures proper parameter combinations. get_dtype_args(): Returns environment variables for dtype configuration. get_custom_env_vars(): Returns custom environment variables for the endpoint.

Example:

config = InferenceEndpointModelConfig(
    model_name="microsoft/DialoGPT-medium",
    instance_type="nvidia-a100",
    instance_size="x1",
    vendor="aws",
    region="us-east-1",
    dtype="float16",
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Note:

  • Creates dedicated infrastructure for model inference
  • Supports various quantization methods and hardware configurations
  • Auto-scaling available for optimal resource utilization
  • Requires HuggingFace Pro subscription for most features
  • Endpoints can take several minutes to start up
  • Billed based on compute usage and duration

class lighteval.models.endpoints.endpoint_model.ServerlessEndpointModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str add_special_tokens: bool = True batch_size: int = 1 )

Parameters

  • model_name (str) — HuggingFace Hub model ID to use with the Inference API. Example: “meta-llama/Llama-3.1-8B-Instruct”
  • add_special_tokens (bool) — Whether to add special tokens during tokenization. Defaults to True.
  • batch_size (int) — Batch size for requests. Defaults to 1 (serverless API limitation).

Configuration class for HuggingFace Inference API (inference endpoints).

https://huggingface.co/inference-endpoints/dedicated

Example:

config = ServerlessEndpointModelConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

TGI ModelClient

class lighteval.models.endpoints.tgi_model.TGIModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None inference_server_address: str | None inference_server_auth: str | None model_name: str | None )

Parameters

  • inference_server_address (str | None) — Address of the TGI server. Format: “http://host:port” or “https://host:port”. Example: “http://localhost:8080”
  • inference_server_auth (str | None) — Authentication token for the TGI server. If None, no authentication is used.
  • model_name (str | None) — Optional model name override. If None, uses the model name from server info.

Configuration class for Text Generation Inference (TGI) backend.

doc: https://huggingface.co/docs/text-generation-inference/en/index

This configuration is used to connect to TGI servers that serve HuggingFace models using the text-generation-inference library. TGI provides high-performance inference with features like continuous batching and efficient memory management.

Example:

config = TGIModelConfig(
    inference_server_address="http://localhost:8080",
    inference_server_auth="your-auth-token",
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Litellm Model

class lighteval.models.litellm_model.LiteLLMModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str provider: str | None = None base_url: str | None = None api_key: str | None = None )

Parameters

  • model_name (str) — Model identifier. Can include provider prefix (e.g., “gpt-4”, “claude-3-sonnet”) or use provider/model format (e.g., “openai/gpt-4”, “anthropic/claude-3-sonnet”).
  • provider (str | None) — Optional provider name override. If None, inferred from model_name. Examples: “openai”, “anthropic”, “google”, “cohere”, etc.
  • base_url (str | None) — Custom base URL for the API. If None, uses provider’s default URL. Useful for using custom endpoints or local deployments.
  • api_key (str | None) — API key for authentication. If None, reads from environment variables. Environment variable names are provider-specific (e.g., OPENAI_API_KEY).

Configuration class for LiteLLM unified API client.

This configuration is used to connect to various LLM providers through the LiteLLM unified API. LiteLLM provides a consistent interface to multiple providers including OpenAI, Anthropic, Google, and many others.

litellm doc: https://docs.litellm.ai/docs/

Example:

config = LiteLLMModelConfig(
    model_name="gpt-4",
    provider="openai",
    base_url="https://api.openai.com/v1",
    generation_parameters=GenerationParameters(
        temperature=0.7,
        max_new_tokens=100
    )
)

Custom Model

class lighteval.models.custom.custom_model.CustomModelConfig

< >

( generation_parameters: GenerationParameters = GenerationParameters(early_stopping=None, repetition_penalty=None, frequency_penalty=None, length_penalty=None, presence_penalty=None, max_new_tokens=None, min_new_tokens=None, seed=None, stop_tokens=None, temperature=0, top_k=None, min_p=None, top_p=None, truncate_prompt=None, response_format=None) system_prompt: str | None = None model_name: str model_definition_file_path: str )

Parameters

  • model (str) — An identifier for the model. This can be used to track which model was evaluated in the results and logs.
  • model_definition_file_path (str) — Path to a Python file containing the custom model implementation. This file must define exactly one class that inherits from LightevalModel. The class should implement all required methods from the LightevalModel interface.

Configuration class for loading custom model implementations in Lighteval.

This config allows users to define and load their own model implementations by specifying a Python file containing a custom model class that inherits from LightevalModel.

The custom model file should contain exactly one class that inherits from LightevalModel. This class will be automatically detected and instantiated when loading the model.

Example usage:

# Define config
config = CustomModelConfig(
    model="my-custom-model",
    model_definition_file_path="path/to/my_model.py"
)

# Example custom model file (my_model.py):
from lighteval.models.abstract_model import LightevalModel

class MyCustomModel(LightevalModel):
    def __init__(self, config, env_config):
        super().__init__(config, env_config)
        # Custom initialization...

    def greedy_until(self, docs: list[Doc]) -> list[ModelResponse]:
        # Custom generation logic...
        pass

    def loglikelihood(self, docs: list[Doc]) -> list[ModelResponse]:
        pass

An example of a custom model can be found in examples/custom_models/google_translate_model.py.

< > Update on GitHub