open-llm-leaderboard/open_llm_leaderboard

Hi,
I used the command from your FAQ to run the evaluation for myself.

After 13 hours of "Running loglikelihood requests" it ran into this error:
Running generate_until requests: 0%| | 0/1865 [00:00<?, ?it/s]Traceback (most recent call last):
File "/scratch-scc/users/u12246/environments/openllm_env/bin/lm-eval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/main.py", line 382, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1326, in generate_until
chunks = re_ords.get_batched(
^^^^^^^^^^^^^^^^^^^^
TypeError: Collator.get_batched() got an unexpected keyword argument 'reset_batch_fn'

Also some INFO:
[init.py:512] The tag xnli is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
[task.py:337] [Task: leaderboard_musr_team_allocation] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
And WARNING:
[task.py:337] [Task: leaderboard_musr_object_placements] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_musr_murder_mysteries] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.

Do you have a fix? What is the downside of using EleutherAI/lm-evaluation-harness?

Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

Reproducibility error