Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

770

Question about recorded evaluation time

#367

by geoalgo - opened Nov 10, 2023

Discussion

geoalgo

Nov 10, 2023

Hi,

I have a question regarding the field "evaluation_time" present in the evaluation data.
On some cases, it has a reasonable value (e.g. a few hours for pythia 7B https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/EleutherAI/pythia-6.9b-deduped/results_2023-10-22T01-47-10.144336.json#L100)

But on other cases, it is 0 (for instance https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/01-ai/Yi-34B/results_2023-11-02T14-47-02.861015.json#L1365 ).

Could you tell how this field is measured and why the value is sometimes zero? (the runtime information can be super-useful to do post-hoc analysis)

Many thanks!

clefourrier

Open LLM Leaderboard org Nov 10, 2023

Hi!
We use a end_time - start_time difference, both being simply logged with time.time() (at the start and end, ^^), plus some nice formatting.
I have no idea why in some cases it is zero, but this is definitely a bug! Did you see it occur for more models?

SaylorTwift

Open LLM Leaderboard org Nov 12, 2023

It is a bug ! I fixed it, you should start seeing the evaluation time again for models being evaluated. I can't, however, add the time for the 100 or so models that were affected.

SaylorTwift changed discussion status to closed Nov 12, 2023

geoalgo

Nov 13, 2023

Thanks for fixing the issue.

Could you provide details on how the runtime is measured? Does it include dataset downloading time / prepreprocessing for instance? or just the evaluation?

Is the code/script that calls llm-harness available somewhere? (If not do you have plans to open-source it?)
It would be very useful to make the benchmark even more transparent (I understand that the scheduling logic should be private, but the launching script may be shared).

clefourrier

Open LLM Leaderboard org Nov 13, 2023

Hi @geoalgo ,

It includes everything, downloading the model, loading the dataset in memory, computations running time on 8GPUs in parallel, actual evaluation. A very conservative estimation would be that GPU time is about 90% of overall time (it's actually more than that).

The launching script (with our logging + the parallelization code) is not available yet but yes, we plan on making it available in the future. We'll make sure to communicate about it once it's done.

geoalgo

Nov 21, 2023

It seems that the code change made ended up modifying the schema of results, for instance grendel has total_evaluation_time_secondes stored in config_general, see https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/openaccess-ai-collective/grendel/results_2023-11-19T14-02-28.206445.json#L11

This is breaking the previous results schema, for instance https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/01-ai/Yi-34B/results_2023-11-02T14-47-02.861015.json#L1365

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment