Resubmit mlabonne/OrpoLlama-3-8B

#725
by mlabonne - opened

Hi,

Is it possible to resubmit mlabonne/OrpoLlama-3-8B? FP16, chat model. I can't do it by myself, unfortunately.

Thanks in advance,

Maxime

I'm also interested in the evaluation results for this one. Apparently it failed so according to the leaderboard FAQ it speeds things along if you provide a direct link.

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/mlabonne/OrpoLlama-3-8B_eval_request_False_float16_Original.json

Hugging Face H4 org

Hi!
Thanks a lot @Phil337 !
@mlabonne , I passed your model to pending again, it had failed with all the network problems we had earlier.

Closing the issue, feel free to reopen if it fails again :)

clefourrier changed discussion status to closed

Thanks @clefourrier but it looks like the run failed again: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/mlabonne/OrpoLlama-3-8B_eval_request_False_float16_Original.json

Could you share the error message so I can debug it?

Hugging Face H4 org

Here it is. Again a network issue, apparently.

(
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51.2k/51.2k [00:00<00:00, 57.9MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9.08M/9.08M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 419/419 [00:00<00:00, 5.34MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    Model parallel was set to True, setting max memory per GPU to {0: 83631210496} and device map to auto
.../lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:468: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
model.safetensors.index.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23.9k/23.9k [00:00<00:00, 150MB/s]
Downloading shards:   0%|          | 0/4 [00:20<?, ?it/s]
  } [37.615s]
} [1m4.047s]
Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File ".../lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File ".../lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
    raise new_e
  File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
    conn.connect()
  File ".../lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect
    self.sock = sock = self._new_conn()
  File ".../lib/python3.10/site-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f68b404a980>: Failed to resolve 'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File ".../lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: ... : Failed to resolve 'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
    model = self.AUTO_MODEL_CLASS.from_pretrained(
  File ".../lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File ".../lib/python3.10/site-packages/transformers/modeling_utils.py", line 3436, in from_pretrained
    resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
  File ".../lib/python3.10/site-packages/transformers/utils/hub.py", line 1038, in get_checkpoint_shard_files
    cached_filename = cached_file(
  File ".../lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File ".../lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1461, in hf_hub_download
    http_get(
  File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 468, in http_get
    r = _request_wrapper(
  File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 425, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File ".../lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File ".../lib/python3.10/site-packages/requests/sessions.py", line 725, in send
    history = [resp for resp in gen]
  File ".../lib/python3.10/site-packages/requests/sessions.py", line 725, in <listcomp>
    history = [resp for resp in gen]
  File ".../lib/python3.10/site-packages/requests/sessions.py", line 266, in resolve_redirects
    resp = self.send(
  File ".../lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File ".../lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send
    return super().send(request, *args, **kwargs)
  File ".../lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com\', port=443): Max retries exceeded with url: ...: Failed to resolve \'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: bf471ade-b843-4e37-8f37-5070efcd509a)')
[2024-05-02 19:27:11,994] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414491 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414492 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414493 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414494 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414495 closing signal SIGTERM
[2024-05-02 19:27:11,996] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414496 closing signal SIGTERM
[2024-05-02 19:27:11,996] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414497 closing signal SIGTERM
[2024-05-02 19:27:13,215] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 414490) of binary: .../bin/python
Traceback (most recent call last):
  File ".../bin/accelerate", line 8, in <module>
    sys.exit(main())
  File ".../lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File ".../lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File ".../lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File ".../lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File ".../lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File ".../lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-02_19:27:11
  host      : ip-26-0-165-59.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 414490)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-165-59: task 0: Exited with exit code 1

Thanks @clefourrier ! I would like to re-submit it again if that's possible

Hugging Face H4 org

Hi, just passed it to pending again! I'm investigating with the cluster team in the meantime.

Sign up or log in to comment