| W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] |
| W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] ***************************************** |
| W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W0408 11:05:05.680000 100112 site-packages/torch/distributed/run.py:793] ***************************************** |
| ζΆι΄η½ζ ΌοΌt_c=0.25, ζ₯ζ° (1βt_c)=100, (t_cβ0)=2 |
| Total number of images that will be sampled: 5120 |
|
sampling: 0%| | 0/20 [00:00<?, ?it/s][rank3]:[W408 11:06:31.576422087 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank0]:[W408 11:06:31.621528015 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank2]:[W408 11:06:31.627182760 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
| [rank1]:[W408 11:06:32.966218444 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id. |
|
sampling: 5%|β | 1/20 [00:28<09:00, 28.44s/it]
sampling: 10%|β | 2/20 [00:54<08:11, 27.28s/it]
sampling: 15%|ββ | 3/20 [01:21<07:37, 26.89s/it]
sampling: 20%|ββ | 4/20 [01:47<07:07, 26.74s/it]
sampling: 25%|βββ | 5/20 [02:14<06:39, 26.61s/it]
sampling: 30%|βββ | 6/20 [02:40<06:12, 26.57s/it]
sampling: 35%|ββββ | 7/20 [03:07<05:45, 26.58s/it]
sampling: 40%|ββββ | 8/20 [03:33<05:18, 26.57s/it]
sampling: 45%|βββββ | 9/20 [04:00<04:51, 26.53s/it]
sampling: 50%|βββββ | 10/20 [04:26<04:24, 26.48s/it]
sampling: 55%|ββββββ | 11/20 [04:53<03:58, 26.51s/it]W0408 11:11:04.741000 100112 site-packages/torch/distributed/elastic/agent/server/api.py:704] Received 15 death signal, shutting down workers |
| W0408 11:11:04.746000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100158 closing signal SIGTERM |
| W0408 11:11:04.748000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100159 closing signal SIGTERM |
| W0408 11:11:04.749000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100160 closing signal SIGTERM |
| W0408 11:11:04.749000 100112 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 100161 closing signal SIGTERM |
| Traceback (most recent call last): |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/bin/torchrun", line 33, in <module> |
| sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')()) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
| return f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main |
| run(args) |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run |
| elastic_launch( |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent |
| result = agent.run() |
| ^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper |
| result = f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run |
| result = self._invoke_run(role) |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run |
| time.sleep(monitor_interval) |
| File "/gemini/space/zhaozy/guzhenyu/envs/envs/SiT/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler |
| raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) |
| torch.distributed.elastic.multiprocessing.api.SignalException: Process 100112 got signal: 15 |
|
|