Reproducibility error

#1020
by cluebbers - opened

Hi,
I used the command from your FAQ to run the evaluation for myself.

After 13 hours of "Running loglikelihood requests" it ran into this error:
Running generate_until requests: 0%| | 0/1865 [00:00<?, ?it/s]Traceback (most recent call last):
File "/scratch-scc/users/u12246/environments/openllm_env/bin/lm-eval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/main.py", line 382, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1326, in generate_until
chunks = re_ords.get_batched(
^^^^^^^^^^^^^^^^^^^^
TypeError: Collator.get_batched() got an unexpected keyword argument 'reset_batch_fn'

Also some INFO:
[init.py:512] The tag xnli is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
[task.py:337] [Task: leaderboard_musr_team_allocation] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
And WARNING:
[task.py:337] [Task: leaderboard_musr_object_placements] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_musr_murder_mysteries] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.

Do you have a fix? What is the downside of using EleutherAI/lm-evaluation-harness?

Open LLM Leaderboard org

Hi @cluebbers ,

Let me try to help you! Could you please provide the exact command you used, what model you are trying to evaluate, and what hardware you are using?

Sign up or log in to comment