EvaluationTracker

class lighteval.logging.evaluation_tracker.EvaluationTracker

( output_dir: str results_path_template: str | None = None save_details: bool = True push_to_hub: bool = False push_to_tensorboard: bool = False hub_results_org: str | None = '' tensorboard_metric_prefix: str = 'eval' public: bool = False nanotron_run_info: GeneralArgs = None use_wandb: bool = False )

Parameters

output_dir (str) — Local directory to save evaluation results and logs
results_path_template (str, optional) — Template for results directory structure. Example: “{outputdir}/results/{org}{model}”
save_details (bool, defaults to True) — Whether to save detailed evaluation records
push_to_hub (bool, defaults to False) — Whether to push results to HF Hub
push_to_tensorboard (bool, defaults to False) — Whether to push metrics to TensorBoard
hub_results_org (str, optional) — HF Hub organization to push results to
tensorboard_metric_prefix (str, defaults to “eval”) — Prefix for TensorBoard metrics
public (bool, defaults to False) — Whether to make Hub datasets public
nanotron_run_info (GeneralArgs, optional) — Nanotron model run information
use_wandb (bool, defaults to False) — Whether to log to Weights & Biases or Trackio if available

Tracks and manages evaluation results, metrics, and logging for model evaluations.

The EvaluationTracker coordinates multiple specialized loggers to track different aspects of model evaluation:

Details Logger (DetailsLogger): Records per-sample evaluation details and predictions
Metrics Logger (MetricsLogger): Tracks aggregate evaluation metrics and scores
Versions Logger (VersionsLogger): Records task and dataset versions
General Config Logger (GeneralConfigLogger): Stores overall evaluation configuration
Task Config Logger (TaskConfigLogger): Maintains per-task configuration details

The tracker can save results locally and optionally push them to:

Hugging Face Hub as datasets
TensorBoard for visualization
Trackio or Weights & Biases for experiment tracking

Example:

tracker = EvaluationTracker(
    output_dir="./eval_results",
    push_to_hub=True,
    hub_results_org="my-org",
    save_details=True
)

# Log evaluation results
tracker.metrics_logger.add_metric("accuracy", 0.85)
tracker.details_logger.add_detail(task_name="qa", prediction="Paris")

# Save all results
tracker.save()

generate_final_dict

< source >

( ) → dict

Returns

dict

Dictionary containing all experiment information including config, results, versions, and summaries

Aggregates and returns all the logger’s experiment information in a dictionary.

This function should be used to gather and display said information at the end of an evaluation run.

push_to_hub

< source >

( date_id: str details: dict results_dict: dict )

Pushes the experiment details (all the model predictions for every step) to the hub.

recreate_metadata_card

< source >

( repo_id: str )

Parameters

repo_id (str) — Details dataset repository path on the hub (org/dataset)

Fully updates the details repository metadata card for the currently evaluated model

save

< source >

( )

Saves the experiment information and results to files, and to the hub if requested.

< > Update on GitHub

Lighteval

EvaluationTracker

class lighteval.logging.evaluation_tracker.EvaluationTracker

generate_final_dict

push_to_hub

recreate_metadata_card

save