Lighteval documentation

Tasks

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Tasks

LightevalTask

LightevalTaskConfig

class lighteval.tasks.lighteval_task.LightevalTaskConfig

< >

( name: str prompt_function: typing.Callable[[dict, str], lighteval.tasks.requests.Doc] hf_repo: str hf_subset: str metrics: list[lighteval.metrics.utils.metric_utils.Metric] | tuple[lighteval.metrics.utils.metric_utils.Metric, ...] hf_revision: str | None = None hf_filter: typing.Optional[typing.Callable[[dict], bool]] = None hf_avail_splits: list[str] | tuple[str, ...] = <factory> trust_dataset: bool = False evaluation_splits: list[str] | tuple[str, ...] = <factory> few_shots_split: str | None = None few_shots_select: str | None = None generation_size: int | None = None generation_grammar: huggingface_hub.inference._generated.types.text_generation.TextGenerationInputGrammarType | None = None stop_sequence: list[str] | tuple[str, ...] | None = None num_samples: list[int] | None = None suite: list[str] | tuple[str, ...] = <factory> original_num_docs: int = -1 effective_num_docs: int = -1 must_remove_duplicate_docs: bool = False num_fewshots: int = 0 truncate_fewshots: bool = False version: int = 0 )

Parameters

  • name (str) — Short name of the evaluation task.
  • suite (list[str]) — Evaluation suites to which the task belongs.
  • prompt_function (Callable[[dict, str], Doc]) — Function used to create the Doc samples from each line of the evaluation dataset.
  • hf_repo (str) — Path of the hub dataset repository containing the evaluation information.
  • hf_subset (str) — Subset used for the current task, will be default if none is selected.
  • hf_avail_splits (list[str]) — All the available splits in the evaluation dataset
  • evaluation_splits (list[str]) — List of the splits actually used for this evaluation
  • few_shots_split (str) — Name of the split from which to sample few-shot examples
  • few_shots_select (str) — Method with which to sample few-shot examples
  • generation_size (int) — Maximum allowed size of the generation
  • generation_grammar (TextGenerationInputGrammarType) — The grammar to generate completion according to. Currently only available for TGI and Inference Endpoint models.
  • metric (list[str]) — List of all the metrics for the current task.
  • stop_sequence (list[str]) — Stop sequence which interrupts the generation for generative metrics.
  • original_num_docs (int) — Number of documents in the task
  • effective_num_docs (int) — Number of documents used in a specific evaluation
  • truncated_num_docs (bool) — Whether less than the total number of documents were used
  • trust_dataset (bool) — Whether to trust the dataset at execution or not
  • version (int) — The version of the task. Defaults to 0. Can be increased if the underlying dataset or the prompt changes.

Stored configuration of a given LightevalTask.

LightevalTask

class lighteval.tasks.lighteval_task.LightevalTask

< >

( config: LightevalTaskConfig )

aggregation

< >

( )

Return a dict with metric name and its aggregation function for all metrics

eval_docs

< >

( ) list[Doc]

Returns

list[Doc]

Evaluation documents.

Returns the evaluation documents.

fewshot_docs

< >

( ) list[Doc]

Returns

list[Doc]

Documents that will be used for few shot examples. One document = one few shot example.

Returns the few shot documents. If the few shot documents are not available, it gets them from the few shot split or the evaluation split.

get_first_possible_fewshot_splits

< >

( available_splits: list[str] | tuple[str, ...] ) str

Returns

str

the first available fewshot splits or None if nothing is available

Parses the possible fewshot split keys in order: train, then validation keys and matches them with the available keys. Returns the first available.

load_datasets

< >

( tasks: dict dataset_loading_processes: int = 1 )

Parameters

  • tasks (list) — A list of tasks.
  • dataset_loading_processes (int, optional) — number of processes to use for dataset loading. Defaults to 1.

Load datasets from the HuggingFace Hub for the given tasks.

PromptManager

class lighteval.tasks.prompt_manager.PromptManager

< >

( use_chat_template: bool = False tokenizer = None system_prompt: str | None = None )

prepare_prompt

< >

( doc: Doc )

Prepare a prompt from a document, either using chat template or plain text format.

prepare_prompt_api

< >

( doc: Doc )

Prepare a prompt for API calls, using a chat-like format. Will not tokenize the message because APIs will usually handle this.

Registry

class lighteval.tasks.registry.Registry

< >

( custom_tasks: str | pathlib.Path | module | None = None )

The Registry class is used to manage the task registry and get task classes.

create_custom_tasks_module

< >

( custom_tasks: str | pathlib.Path | module ) ModuleType

Parameters

  • custom_tasks (Optional[Union[str, ModuleType]]) — Path to the custom tasks file or name of a module to import containing custom tasks or the module itself

Returns

ModuleType

The newly imported/created custom tasks modules

Creates a custom task module to load tasks defined by the user in their own file.

create_task_config_dict

< >

( meta_table: list[lighteval.tasks.lighteval_task.LightevalTaskConfig] | None = None ) Dict[str, LightevalTask]

Parameters

  • meta_table — meta_table containing tasks configurations. If not provided, it will be loaded from TABLE_PATH.
  • cache_dir — Directory to store cached data. If not provided, the default cache directory will be used.

Returns

Dict[str, LightevalTask]

A dictionary of task names mapped to their corresponding LightevalTask classes.

Create configuration tasks based on the provided meta_table.

expand_task_definition

< >

( task_definition: str ) list[str]

Parameters

  • task_definition (str) — Task definition to expand. In format:
    • suite|task
    • suite|task_superset (e.g lighteval|mmlu, which runs all the mmlu subtasks)

Returns

list[str]

List of task names (suite|task)

get_tasks_configs

< >

( task: str )

task is a string of the form “suite|task|few_shot|truncate_few_shots,suite|task|few_shot|truncate_few_shots”

returns a LightevalTaskConfig object based on the task name and fewshot and truncate_few_shots values.

print_all_tasks

< >

( )

Print all the tasks in the task registry.

taskinfo_selector

< >

( tasks: str ) tuple[list[str], dict[str, list[tuple[int, bool]]]]

Parameters

  • tasks (str) — A string containing a comma-separated list of tasks definitions in the format: “task_definition”, where it can be containing a list of tasks. where task_definition can be:
    • path to a file containing a list of tasks (one per line)
    • task group defined in TASKS_GROUPS dict in custom tasks file
    • task name with few shot in format “suite|task|few_shot|truncate_few_shots”
    • task superset in format “suite|task_superset|few_shot|truncate_few_shots” (superset will run all tasks with format “suite|task_superset:{subset}|few_shot|truncate_few_shots”)

Returns

tuple[list[str], dict[str, list[tuple[int, bool]]]]

A tuple containing:

  • A sorted list of unique task names in the format “suite|task”.
  • A dictionary mapping each task name to a list of tuples representing the few_shot and truncate_few_shots values.

Converts a input string of tasks name to task information usable by lighteval.

Doc

class lighteval.tasks.requests.Doc

< >

( query: str choices: list gold_index: typing.Union[int, list[int]] instruction: str | None = None images: list['Image'] | None = None specific: dict | None = None unconditioned_query: str | None = None original_query: str | None = None id: str = '' task_name: str = '' num_asked_few_shots: int = 0 num_effective_few_shots: int = 0 fewshot_samples: list = <factory> sampling_methods: list = <factory> fewshot_sorting_class: str | None = None generation_size: int | None = None stop_sequences: list[str] | None = None use_logits: bool = False num_samples: int = 1 generation_grammar: None = None )

Parameters

  • query (str) — The main query, prompt, or question to be sent to the model.
  • choices (list[str]) — List of possible answer choices for the query. For multiple choice tasks, this contains all options (A, B, C, D, etc.). For generative tasks, this may be empty or contain reference answers.
  • gold_index (Union[int, list[int]]) — Index or indices of the correct answer(s) in the choices list. For single correct answers,(e.g., 0 for first choice). For multiple correct answers, use a list (e.g., [0, 2] for first and third).
  • instruction (str | None) — System prompt or task-specific instructions to guide the model. This is typically prepended to the query to set context or behavior.
  • images (list[“Image”] | None) — List of PIL Image objects for multimodal tasks.
  • specific (dict | None) — Task-specific information or metadata. Can contain any additional data needed for evaluation.
  • unconditioned_query (Optional[str]) — Query without task-specific context for PMI normalization. Used to calculate: log P(choice | Query) - log P(choice | Unconditioned Query).
  • original_query (str | None) — The query before any preprocessing or modification.
  • # Set by task parameters —
  • id (str) — Unique identifier for this evaluation instance. Set by the task and not the user.
  • task_name (str) — Name of the task or benchmark this Doc belongs to.
  • ## Few-shot Learning Parameters —
  • num_asked_few_shots (int) — Number of few-shot examples requested for this instance.
  • num_effective_few_shots (int) — Actual number of few-shot examples used (may differ from requested).
  • fewshot_samples (list) — List of Doc objects representing few-shot examples. These examples are prepended to the main query to provide context.
  • sampling_methods (list[SamplingMethod]) — List of sampling methods to use for this instance. Options: GENERATIVE, LOGPROBS, PERPLEXITY.
  • fewshot_sorting_class (Optional[str]) — Class label for balanced few-shot example selection. Used to ensure diverse representation in few-shot examples.
  • ## Generation Control Parameters —
  • generation_size (int | None) — Maximum number of tokens to generate for this instance.
  • stop_sequences (list[str] | None) — List of strings that should stop generation when encountered. Used for: Controlled generation, preventing unwanted continuations.
  • use_logits (bool) — Whether to return logits (raw model outputs) in addition to text. Used for: Probability analysis, confidence scoring, detailed evaluation.
  • num_samples (int) — Number of different samples to generate for this instance. Used for: Diversity analysis, uncertainty estimation, ensemble methods.
  • generation_grammar (None) — Grammar constraints for generation (currently not implemented). Reserved for: Future structured generation features.

Dataclass representing a single evaluation sample for a benchmark.

This class encapsulates all the information needed to evaluate a model on a single task instance. It contains the input query, expected outputs, metadata, and configuration parameters for different types of evaluation tasks.

Required Fields:

  • query: The input prompt or question
  • choices: Available answer choices (for multiple choice tasks)
  • gold_index: Index(es) of the correct answer(s)

Optional Fields:

  • instruction: System prompt, task specific. Will be appended to model specific system prompt.
  • images: Visual inputs for multimodal tasks.

Methods: get_golds(): Returns the correct answer(s) as strings based on gold_index. Handles both single and multiple correct answers.

Usage Examples:

Multiple Choice Question:

doc = Doc(
    query="What is the capital of France?",
    choices=["London", "Paris", "Berlin", "Madrid"],
    gold_index=1,  # Paris is the correct answer
    instruction="Answer the following geography question:",
)

Generative Task:

doc = Doc(
    query="Write a short story about a robot.",
    choices=[],  # No predefined choices for generative tasks
    gold_index=0,  # Not used for generative tasks
    generation_size=100,
    stop_sequences=["

End"],
)

Few-shot Learning:

doc = Doc(
    query="Translate 'Hello world' to Spanish.",
    choices=["Hola mundo", "Bonjour monde", "Ciao mondo"],
    gold_index=0,
    fewshot_samples=[
        Doc(query="Translate 'Good morning' to Spanish.",
            choices=["Buenos días", "Bonjour", "Buongiorno"],
            gold_index=0),
        Doc(query="Translate 'Thank you' to Spanish.",
            choices=["Gracias", "Merci", "Grazie"],
            gold_index=0)
    ],
)

Multimodal Task:

doc = Doc(
    query="What is shown in this image?",
    choices=["A cat"],
    gold_index=0,
    images=[pil_image],  # PIL Image object
)

get_golds

< >

( )

Return gold targets extracted from the target dict

Datasets

class lighteval.data.DynamicBatchDataset

< >

( requests: list num_dataset_splits: int )

get_original_order

< >

( new_arr: list ) list

Parameters

  • newarr (list) — Array containing any kind of data that needs to be reset in the original order.

Returns

list

new_arr in the original order.

Get the original order of the data.

splits_iterator

< >

( ) Subset

Yields

Subset

Iterator that yields the dataset splits based on the split limits.

class lighteval.data.LoglikelihoodDataset

< >

( requests: list num_dataset_splits: int )

class lighteval.data.GenerativeTaskDataset

< >

( requests: list num_dataset_splits: int )

init_split_limits

< >

( num_dataset_splits ) type

Parameters

  • num_dataset_splits (type) — description

Returns

type

description

Initialises the split limits based on generation parameters. The splits are used to estimate time remaining when evaluating, and in the case of generative evaluations, to group similar samples together.

For generative tasks, self._sorting_criteria outputs:

  • a boolean (whether the generation task uses logits)
  • a list (the stop sequences)
  • the item length (the actual size sorting factor).

In the current function, we create evaluation groups by generation parameters (logits and eos), so that samples with similar properties get batched together afterwards. The samples will then be further organised by length in each split.

class lighteval.data.GenerativeTaskDatasetNanotron

< >

( requests: list num_dataset_splits: int )

class lighteval.data.GenDistributedSampler

< >

( dataset: Dataset num_replicas: typing.Optional[int] = None rank: typing.Optional[int] = None shuffle: bool = True seed: int = 0 drop_last: bool = False )

A distributed sampler that copy the last element only when drop_last is False so we keep a small padding in the batches as our samples are sorted by length.

< > Update on GitHub