Models Failure

#374
by Weyaxi - opened

Hi, it seems like the following models have failed. I would greatly appreciate it if you could let me know what is wrong with the models or relaunch them if there is nothing wrong. Thanks, and have a nice day.

Weyaxi changed discussion title from Request the relaunch models to Models Failure
Hugging Face H4 org

Hi!
We have a new system to run our evaluations on the HF cluster, where the leaderboard evaluations get cancelled automatically if a higher priority job needs resources.
The jobs get relaunched automatically in the end, but they get displayed as failed in the meantime. We'll try to improve our logging asap!

Side note, for model failures issues, we require users to point to the correct request files for each model, so we can access relevant leaderboard logs faster (see this issue for a very good example).

Hi, thanks for your interest and explanation. Have a nice day.

Weyaxi changed discussion status to closed
Hugging Face H4 org

No problem! :)
(However, if you observe that models are still not good in a week, feel free to ping us and we'll investigate in more detail to see if something else happened)

Weyaxi changed discussion status to open
  • Some models were cancelled and never relaunched, I'm adding them back to the queue:
    Weyaxi/Dolphin-Nebula-7B_eval_request_False_float16_Original.json
    Weyaxi/OpenHermes-2.5-Nebula-v2-7B_eval_request_False_float16_Original.json
    Weyaxi/OpenOrca-Zephyr-7B_eval_request_False_bfloat16_Original.json
    Weyaxi/SynthIA-v1.3-Nebula-v2-7B_eval_request_False_float16_Original.json
    PulsarAI/Nebula-v2-7B_eval_request_False_float16_Original.json

  • This model crashed because of a node failure, adding back too:
    Weyaxi/zephyr-beta-Nebula-v2-7B_eval_request_False_float16_Original.json

  • I think this one was started before we updated the backend's transformers to a version which supports Mistral models (ping @SaylorTwift , can you check the transformers version in the backend?)
    Weyaxi/CollectiveCognition-v1.1-Nebula-7B_eval_request_False_float16_Original.json

  • Lastly, this model is faulty
    Weyaxi/Mistral-11B-OpenOrcaPlatypus_eval_request_False_bfloat16_Original.json
    It failed with

  File ".../python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File ".../lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File ".../lib/python3.10/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32002, 4096]) in "weight" (which has shape torch.Size([32000, 4096])), this look incorrect.
Hugging Face H4 org

Thank you very much for your patience and for linking the request files :)

Hi, thank you very much for relaunching. I will check the last model.

Hugging Face H4 org

Closing, feel free to reopen if needed

clefourrier changed discussion status to closed
Weyaxi changed discussion status to open
Hugging Face H4 org

Hi,
The new cluster is having strong connectivity problems, we are putting all evals on hold til it's fixed, and we'll relaunch all FAILED evals of the past 2 days

Hugging Face H4 org

We solved the connectivity issues and the models have been evaluated :)

SaylorTwift changed discussion status to closed

Sign up or log in to comment