Lighteval documentation

EvaluationTracker

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

EvaluationTracker

class lighteval.logging.evaluation_tracker.EvaluationTracker

< >

( output_dir: str results_path_template: str | None = None save_details: bool = True push_to_hub: bool = False push_to_tensorboard: bool = False hub_results_org: str | None = '' tensorboard_metric_prefix: str = 'eval' public: bool = False nanotron_run_info: GeneralArgs = None use_wandb: bool = False )

Parameters

  • output_dir (str) — Local directory to save evaluation results and logs
  • results_path_template (str, optional) — Template for results directory structure. Example: “{outputdir}/results/{org}{model}”
  • save_details (bool, defaults to True) — Whether to save detailed evaluation records
  • push_to_hub (bool, defaults to False) — Whether to push results to HF Hub
  • push_to_tensorboard (bool, defaults to False) — Whether to push metrics to TensorBoard
  • hub_results_org (str, optional) — HF Hub organization to push results to
  • tensorboard_metric_prefix (str, defaults to “eval”) — Prefix for TensorBoard metrics
  • public (bool, defaults to False) — Whether to make Hub datasets public
  • nanotron_run_info (GeneralArgs, optional) — Nanotron model run information
  • use_wandb (bool, defaults to False) — Whether to log to Weights & Biases or Trackio if available

Tracks and manages evaluation results, metrics, and logging for model evaluations.

The EvaluationTracker coordinates multiple specialized loggers to track different aspects of model evaluation:

  • Details Logger (DetailsLogger): Records per-sample evaluation details and predictions
  • Metrics Logger (MetricsLogger): Tracks aggregate evaluation metrics and scores
  • Versions Logger (VersionsLogger): Records task and dataset versions
  • General Config Logger (GeneralConfigLogger): Stores overall evaluation configuration
  • Task Config Logger (TaskConfigLogger): Maintains per-task configuration details

The tracker can save results locally and optionally push them to:

  • Hugging Face Hub as datasets
  • TensorBoard for visualization
  • Trackio or Weights & Biases for experiment tracking

Example:

tracker = EvaluationTracker(
    output_dir="./eval_results",
    push_to_hub=True,
    hub_results_org="my-org",
    save_details=True
)

# Log evaluation results
tracker.metrics_logger.add_metric("accuracy", 0.85)
tracker.details_logger.add_detail(task_name="qa", prediction="Paris")

# Save all results
tracker.save()

generate_final_dict

< >

( ) dict

Returns

dict

Dictionary containing all experiment information including config, results, versions, and summaries

Aggregates and returns all the logger’s experiment information in a dictionary.

This function should be used to gather and display said information at the end of an evaluation run.

push_to_hub

< >

( date_id: str details: dict results_dict: dict )

Pushes the experiment details (all the model predictions for every step) to the hub.

recreate_metadata_card

< >

( repo_id: str )

Parameters

  • repo_id (str) — Details dataset repository path on the hub (org/dataset)

Fully updates the details repository metadata card for the currently evaluated model

save

< >

( )

Saves the experiment information and results to files, and to the hub if requested.

< > Update on GitHub