Lighteval documentation
Saving and reading results
Saving and reading results
Saving results locally
Lighteval will automatically save results and evaluation details in the
directory set with the --output-dir
option. The results will be saved in
{output_dir}/results/{model_name}/results_{timestamp}.json
. Here is an
example of a result file. The output path can be
any fsspec
compliant path (local, s3, hf hub, gdrive, ftp, etc).
To save the details of the evaluation, you can use the --save-details
option. The details will be saved in a parquet file
{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet
.
If you want results to be saved in a custom path, you can set the results-path-template
option.
This allows you to set a string template for the path. The template need to contain the following
variables: output_dir
, model_name
, org
. For example
{output_dir}/{org}_{model}
. The template will be used to create the path for the results file.
Pushing results to the HuggingFace hub
You can push the results and evaluation details to the HuggingFace hub. To do
so, you need to set the --push-to-hub
as well as the --results-org
option. The results will be saved in a dataset with the name at
{results_org}/{model_org}/{model_name}
. To push the details, you need to set
the --save-details
option.
The dataset created will be private by default, you can make it public by
setting the --public-run
option.
Pushing results to Tensorboard
You can push the results to Tensorboard by setting --push-to-tensorboard
.
This will create a Tensorboard dashboard in a HF org set with the --results-org
option.
Pushing results to WandB
You can push the results to WandB by setting --wandb
. This will init a WandB
run and log the results.
Wandb args need to be set in your env variables.
export WANDB_PROJECT="lighteval"
You can find a list of variable in the wandb documentation.
How to load and investigate details
Load from local detail files
from datasets import load_dataset
import os
output_dir = "evals_doc"
model_name = "HuggingFaceH4/zephyr-7b-beta"
timestamp = "latest"
task = "lighteval|gsm8k|0"
if timestamp == "latest":
path = f"{output_dir}/details/{model_org}/{model_name}/*/"
timestamps = glob.glob(path)
timestamp = sorted(timestamps)[-1].split("/")[-2]
print(f"Latest timestamp: {timestamp}")
details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
# Load the details
details = load_dataset("parquet", data_files=details_path, split="train")
for detail in details:
print(detail)
Load from the HuggingFace hub
from datasets import load_dataset
results_org = "SaylorTwift"
model_name = "HuggingFaceH4/zephyr-7b-beta"
sanitized_model_name = model_name.replace("/", "__")
task = "lighteval|gsm8k|0"
public_run = False
dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")
for detail in details:
print(detail)
The detail file contains the following columns:
choices
: The choices presented to the model in the case of mutlichoice tasks.gold
: The gold answer.gold_index
: The index of the gold answer in the choices list.cont_tokens
: The continuation tokens.example
: The input in text form.full_prompt
: The full prompt, that will be inputed to the model.input_tokens
: The tokens of the full prompt.instruction
: The instruction given to the model.metrics
: The metrics computed for the example.num_asked_few_shots
: The number of few shots asked to the model.num_effective_few_shots
: The number of effective few shots.padded
: Whether the input was padded.pred_logits
: The logits of the model.predictions
: The predictions of the model.specifics
: The specifics of the task.truncated
: Whether the input was truncated.
Example of a result file
{
"config_general": {
"lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
"num_fewshot_seeds": 1,
"override_batch_size": 1,
"max_samples": 1,
"job_id": "",
"start_time": 620979.879320166,
"end_time": 621004.632108041,
"total_evaluation_time_secondes": "24.752787875011563",
"model_name": "gpt2",
"model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
"model_dtype": null,
"model_size": "476.2 MB"
},
"results": {
"lighteval|gsm8k|0": {
"qem": 0.0,
"qem_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
},
"all": {
"qem": 0.0,
"qem_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
}
},
"versions": {
"lighteval|gsm8k|0": 0
},
"config_tasks": {
"lighteval|gsm8k": {
"name": "gsm8k",
"prompt_function": "gsm8k",
"hf_repo": "gsm8k",
"hf_subset": "main",
"metric": [
{
"metric_name": "qem",
"higher_is_better": true,
"category": "3",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
},
{
"metric_name": "maj@8",
"higher_is_better": true,
"category": "5",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
}
],
"hf_avail_splits": [
"train",
"test"
],
"evaluation_splits": [
"test"
],
"few_shots_split": null,
"few_shots_select": "random_sampling_from_train",
"generation_size": 256,
"generation_grammar": null,
"stop_sequence": [
"Question="
],
"num_samples": null,
"suite": [
"lighteval"
],
"original_num_docs": 1319,
"effective_num_docs": 1,
"trust_dataset": true,
"must_remove_duplicate_docs": null,
"version": 0
}
},
"summary_tasks": {
"lighteval|gsm8k|0": {
"hashes": {
"hash_examples": "8517d5bf7e880086",
"hash_full_prompts": "8517d5bf7e880086",
"hash_input_tokens": "29916e7afe5cb51d",
"hash_cont_tokens": "37f91ce23ef6d435"
},
"truncated": 2,
"non_truncated": 0,
"padded": 0,
"non_padded": 2,
"effective_few_shots": 0.0,
"num_truncated_few_shots": 0
}
},
"summary_general": {
"hashes": {
"hash_examples": "5f383c395f01096e",
"hash_full_prompts": "5f383c395f01096e",
"hash_input_tokens": "ac933feb14f96d7b",
"hash_cont_tokens": "9d03fb26f8da7277"
},
"truncated": 2,
"non_truncated": 0,
"padded": 0,
"non_padded": 2,
"num_truncated_few_shots": 0
}
}