Another Error

#5
by Gemneye - opened

I tried upgrading torch, torchvision, and torchaudio to see if it made a difference. Now getting a new error. I also downloaded distilled models in case I could not run with the 24B model.

(magi) root@46e1abf287b8:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
[rank0]: pipeline = MagiPipeline(args.config_file)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 32, in init
[rank0]: dist_init(self.config)
[rank0]: File "/workspace/MAGI-1/inference/infra/distributed/dist_utils.py", line 48, in dist_init
[rank0]: assert config.engine_config.cp_size * config.engine_config.pp_size == torch.distributed.get_world_size()
[rank0]: AssertionError
[rank0]:[W423 02:54:17.933492678 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0423 02:54:19.241000 5094 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5163) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_02:54:19
host : 46e1abf287b8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5163)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

It looks like the config needs some modifications. Could you let me know how many GPUs you’re using and what type they are?
Also, make sure that pp_size * cp_size equals the total number of GPUs.

I started all over from scratch. I am getting further but still having problems.

[2025-04-24 01:04:51,105 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0424 01:04:52.556000 132482488543040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3378) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is from the 24B_config.json file

55 "clean_chunk_kvrange": 1,
56 "clean_t": 0.9999,
57 "seed": 83746,
58 "num_frames": 121,
59 "video_size_h": 540,
60 "video_size_w": 960,
61 "num_steps": 8,
62 "window_size": 4,
63 "fps": 24,
64 "chunk_width": 6,
65 "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight",
66 "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
67 "t5_device": "cuda",
68 "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
69 "scale_factor": 0.18215,
70 "temporal_downsample_factor": 4

I have no idea what is going on, but the files in the directory configured by the "load" parameter are the same as those on huggingface. I am not sure about this error: " assert os.path.exists(inference_weight_dir)" . I tried changing directories to one level back, bad that did not make a difference. I tried this with both a single L40 and with 2xL40s. I am not sure if that is too low of specs for this or not. I will try one of the other configurations with the other models, but I certainly cannot get this to work.

I used cpp_size = 2 when I was using 2xL40s and cpp_size=1 when using 1xL40.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment