Spaces:
Running
on
CPU Upgrade
Resubmit mlabonne/OrpoLlama-3-8B
Hi,
Is it possible to resubmit mlabonne/OrpoLlama-3-8B? FP16, chat model. I can't do it by myself, unfortunately.
Thanks in advance,
Maxime
I'm also interested in the evaluation results for this one. Apparently it failed so according to the leaderboard FAQ it speeds things along if you provide a direct link.
Thanks @clefourrier but it looks like the run failed again: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/mlabonne/OrpoLlama-3-8B_eval_request_False_float16_Original.json
Could you share the error message so I can debug it?
Here it is. Again a network issue, apparently.
(
tokenizer_config.json: 100%|ββββββββββ| 51.2k/51.2k [00:00<00:00, 57.9MB/s]
tokenizer.json: 100%|ββββββββββ| 9.08M/9.08M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|ββββββββββ| 419/419 [00:00<00:00, 5.34MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Model parallel was set to True, setting max memory per GPU to {0: 83631210496} and device map to auto
.../lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:468: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
model.safetensors.index.json: 100%|ββββββββββ| 23.9k/23.9k [00:00<00:00, 150MB/s]
Downloading shards: 0%| | 0/4 [00:20<?, ?it/s]
} [37.615s]
} [1m4.047s]
Traceback (most recent call last):
File ".../lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File ".../lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File ".../lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File ".../lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect
self.sock = sock = self._new_conn()
File ".../lib/python3.10/site-packages/urllib3/connection.py", line 205, in _new_conn
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f68b404a980>: Failed to resolve 'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com' ([Errno -3] Temporary failure in name resolution)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File ".../lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File ".../lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: ... : Failed to resolve 'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com' ([Errno -3] Temporary failure in name resolution)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
model = self.AUTO_MODEL_CLASS.from_pretrained(
File ".../lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File ".../lib/python3.10/site-packages/transformers/modeling_utils.py", line 3436, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File ".../lib/python3.10/site-packages/transformers/utils/hub.py", line 1038, in get_checkpoint_shard_files
cached_filename = cached_file(
File ".../lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
File ".../lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1461, in hf_hub_download
http_get(
File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 468, in http_get
r = _request_wrapper(
File ".../lib/python3.10/site-packages/huggingface_hub/file_download.py", line 425, in _request_wrapper
response = get_session().request(method=method, url=url, **params)
File ".../lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File ".../lib/python3.10/site-packages/requests/sessions.py", line 725, in send
history = [resp for resp in gen]
File ".../lib/python3.10/site-packages/requests/sessions.py", line 725, in <listcomp>
history = [resp for resp in gen]
File ".../lib/python3.10/site-packages/requests/sessions.py", line 266, in resolve_redirects
resp = self.send(
File ".../lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File ".../lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send
return super().send(request, *args, **kwargs)
File ".../lib/python3.10/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com\', port=443): Max retries exceeded with url: ...: Failed to resolve \'hf-hub-lfs-us-east-1.s3.us-east-1.amazonaws.com\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: bf471ade-b843-4e37-8f37-5070efcd509a)')
[2024-05-02 19:27:11,994] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414491 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414492 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414493 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414494 closing signal SIGTERM
[2024-05-02 19:27:11,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414495 closing signal SIGTERM
[2024-05-02 19:27:11,996] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414496 closing signal SIGTERM
[2024-05-02 19:27:11,996] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 414497 closing signal SIGTERM
[2024-05-02 19:27:13,215] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 414490) of binary: .../bin/python
Traceback (most recent call last):
File ".../bin/accelerate", line 8, in <module>
sys.exit(main())
File ".../lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File ".../lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File ".../lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File ".../lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File ".../lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File ".../lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-02_19:27:11
host : ip-26-0-165-59.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 414490)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-165-59: task 0: Exited with exit code 1
Thanks @clefourrier ! I would like to re-submit it again if that's possible
Hi, just passed it to pending again! I'm investigating with the cluster team in the meantime.