Lighteval documentation

Model’s Output

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Model’s Output

All models will generate an ouput per Doc supplied to the generation or loglikelihood fuctions.

class lighteval.models.model_output.ModelResponse

< >

( input: str | list | None = None text: list = <factory> logprobs: list = <factory> argmax_logits_eq_gold: list = <factory> logits: list[list[float]] | None = None truncated_tokens_count: int = 0 padded_tokens_count: int = 0 input_tokens: list = <factory> output_tokens: list = <factory> unconditioned_logprobs: typing.Optional[list[float]] = None )

Parameters

  • input (str | list | None) — The original input prompt or context that was fed to the model. Used for debugging and analysis purposes.
  • text (list[str]) — The generated text responses from the model. Each element represents one generation (useful when num_samples > 1). Required for: Generative metrics, exact match, llm as a judge, etc.
  • logprobs (list[float]) — Log probabilities of the generated tokens or sequences. Required for: loglikelihood and perplexity metrics.
  • argmax_logits_eq_gold (list[bool]) — Whether the argmax logits match the gold/expected text. Used for accuracy calculations in multiple choice and classification tasks. Required for: certain loglikelihood metrics.
  • unconditioned_logprobs (Optional[list[float]]) — Log probabilities from an unconditioned model (e.g., without context). Used for PMI (Pointwise Mutual Information) normalization. Required for: PMI metrics.

A class to represent the response from a model during evaluation.

This dataclass contains all the information returned by a model during inference, including generated text, log probabilities, token information, and metadata. Different attributes are required for different types of evaluation metrics.

Usage Examples:

For generative tasks (text completion, summarization):

response = ModelResponse(
    text=["The capital of France is Paris."],
    input_tokens=[1, 2, 3, 4],
    output_tokens=[[5, 6, 7, 8]]
)

For multiple choice tasks:

response = ModelResponse(
    logprobs=[-0.5, -1.2, -2.1, -1.8],  # Logprobs for each choice
    argmax_logits_eq_gold=[False, False, False, False],  # Whether correct choice was selected
    input_tokens=[1, 2, 3, 4],
    output_tokens=[[5], [6], [7], [8]]
)

For perplexity calculation:

response = ModelResponse(
    text=["The model generated this text."],
    logprobs=[-1.2, -0.8, -1.5, -0.9, -1.1],  # Logprobs for each token
    input_tokens=[1, 2, 3, 4, 5],
    output_tokens=[[6], [7], [8], [9], [10]]
)

For PMI analysis:

response = ModelResponse(
    text=["The answer is 42."],
    logprobs=[-1.1, -0.9, -1.3, -0.7],  # Conditioned logprobs
    unconditioned_logprobs=[-2.1, -1.8, -2.3, -1.5],  # Unconditioned logprobs
    input_tokens=[1, 2, 3, 4],
    output_tokens=[[5], [6], [7], [8]]
)

Notes:

  • For most evaluation tasks, only a subset of attributes is required
  • The text attribute is the most commonly used for generative tasks
  • logprobs are essential for probability-based metrics like perplexity
  • argmax_logits_eq_gold is specifically for certain multiple choice/classification tasks
  • Token-level attributes (input_tokens, output_tokens) are useful for debugging
  • Truncation and padding counts help understand model behavior with long inputs
< > Update on GitHub