Question about recorded evaluation time

#367
by geoalgo - opened

Hi,

I have a question regarding the field "evaluation_time" present in the evaluation data.
On some cases, it has a reasonable value (e.g. a few hours for pythia 7B https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/EleutherAI/pythia-6.9b-deduped/results_2023-10-22T01-47-10.144336.json#L100)

But on other cases, it is 0 (for instance https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/01-ai/Yi-34B/results_2023-11-02T14-47-02.861015.json#L1365 ).

Could you tell how this field is measured and why the value is sometimes zero? (the runtime information can be super-useful to do post-hoc analysis)

Many thanks!

Open LLM Leaderboard org

Hi!
We use a end_time - start_time difference, both being simply logged with time.time() (at the start and end, ^^), plus some nice formatting.
I have no idea why in some cases it is zero, but this is definitely a bug! Did you see it occur for more models?

Open LLM Leaderboard org

It is a bug ! I fixed it, you should start seeing the evaluation time again for models being evaluated. I can't, however, add the time for the 100 or so models that were affected.

SaylorTwift changed discussion status to closed

Thanks for fixing the issue.

Could you provide details on how the runtime is measured? Does it include dataset downloading time / prepreprocessing for instance? or just the evaluation?

Is the code/script that calls llm-harness available somewhere? (If not do you have plans to open-source it?)
It would be very useful to make the benchmark even more transparent (I understand that the scheduling logic should be private, but the launching script may be shared).

Open LLM Leaderboard org

Hi @geoalgo ,

It includes everything, downloading the model, loading the dataset in memory, computations running time on 8GPUs in parallel, actual evaluation. A very conservative estimation would be that GPU time is about 90% of overall time (it's actually more than that).

The launching script (with our logging + the parallelization code) is not available yet but yes, we plan on making it available in the future. We'll make sure to communicate about it once it's done.

It seems that the code change made ended up modifying the schema of results, for instance grendel has total_evaluation_time_secondes stored in config_general, see https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/openaccess-ai-collective/grendel/results_2023-11-19T14-02-28.206445.json#L11

This is breaking the previous results schema, for instance https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/01-ai/Yi-34B/results_2023-11-02T14-47-02.861015.json#L1365

Sign up or log in to comment