Spaces:
Running
on
CPU Upgrade
Model evaluation failure: request for additional details
Hi,
I am looking for some additional info on why this model failed:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/abacusai/Giraffe-beta-13b-32k_eval_request_False_float16_Original.json
I was able to locally run:
python main.py --model=hf-causal --model_args="pretrained=abacusai/Giraffe-beta-13b-32k,revision=main" --tasks=arc_challenge --num_fewshot=5 --batch_size=1 --output_path=/tmp/harness-test
So I don't think there was an issue with loading the model. I was however unable to use the use_accelerate=True
flag. Adding the flag did not cause a model issue but it did fail with this error (I do believe I have the right branch):
Traceback (most recent call last):
File "/root/lm-evaluation-harness/main.py", line 93, in <module>
main()
File "/root/lm-evaluation-harness/main.py", line 59, in main
results = evaluator.simple_evaluate(
File "/root/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
return fn(*args, **kwargs)
File "/root/lm-evaluation-harness/lm_eval/evaluator.py", line 72, in simple_evaluate
lm = lm_eval.models.get_model(model).create_from_arg_string(
File "/root/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
return cls(**args, **args2)
TypeError: HFLM.__init__() got an unexpected keyword argument 'use_accelerate'
If you could give me any additional info, I would like to fix up the model config and try submitting again. The model itself is based on Llama2 so there should not be anything unusual about the code. We would like to follow up with a new version that has customizations to support injected soft prompt tokens but before we figure that out it seems like we should get a simpler instruct tuned model through the harness.
If I switch to --model hf-causal-experimental
then I am able to pass the use_accelerate=True
flag.
Hi!
We have a new system to run our evaluations on the HF cluster, where the leaderboard evaluations get cancelled automatically if a higher priority job needs resources.
The jobs get relaunched automatically in the end, but they get displayed as failed in the meantime. I just checked, and your run was cancelled but not relaunched yet, it will be relaunched automatically when we have enough compute available :)
Phew! Thank you. By the way, are the instruction wrong regarding model type (hf-causal-experimental + use_accelearate) Perhaps worth updating?
This model is again updated as failed:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/abacusai/Giraffe-beta-13b-32k_eval_request_False_float16_Original.json
I was able to run the eval command line on our machines and generate a valid result for arc_challenge. Any additional information on the failure on your cluster?
Hi!
It's a node failure, I'll restart it!
Unfortunately, this seemed to fail again. I will try re-running the full suite again locally to confirm but if you have any info on the failure it will be very helpful. Other than the context length change this should be very similar to any other llama2 based model so I am surprised by the failure. Should I try resubmitting the model with a shorter max context length?
FYI - we trained on H100, but I verified execution on an A100 80G machine. Should I test with a 40G mem limit instead?
Hi, we had an issue while downloading the model. The problem appears to be on our end, your model is back in the queue if you notice it fails again don't hesitate to warn us ! :)
Closing, feel free to reopen if needed :)
Seems pretty jinxed :)
I think it is marked failed again. In the background we have gone ahead and run the whole suite on our machines and it seems to work fine. We are running on 80G GPUs not sure if that is the issued. Let me know if you think we should update the model config to indicate a shorter context. The model has been tested to about ~32K context.
It's quite weird, we got a download issue again - I'll try it a third time but if it's still not working we'll have to assume there is an issue with the download of your model.
Logs:
Traceback (most recent call last):
File "...lib/python3.10/site-packages/requests/models.py", line 822, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Downloading (…)of-00003.safetensors: 79%|███████▊ | 7.81G/9.95G [00:50<00:13, 156MB/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492624 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492625 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492626 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492627 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492628 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492629 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492630 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2492623) of binary: python
We are also running experiments on 80G GPUs, I don't think it's a context problem since the model is not even launched ^^
Thanks for the logs. We have successfully redownloaded from HF for testing. I wonder if we somehow ended up on a flaky node LFS. If this continues to be an issues perhaps we will try just reuploading the files in the hope they end up on a new node?
Hi @siddartha-abacus , can I close this issue? Has your model been properly evaluated?
Yes, sorry did not realize it was not closed.
No problem, thank you for closing :)