Another Error

by Gemneye - opened about 24 hours ago

about 24 hours ago

I tried upgrading torch, torchvision, and torchaudio to see if it made a difference. Now getting a new error. I also downloaded distilled models in case I could not run with the 24B model.

api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_02:54:19
host : 46e1abf287b8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5163)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

xiguan97

Sand AI org about 20 hours ago

It looks like the config needs some modifications. Could you let me know how many GPUs you’re using and what type they are?
Also, make sure that pp_size * cp_size equals the total number of GPUs.

Gemneye

about 1 hour ago

I started all over from scratch. I am getting further but still having problems.

[2025-04-24 01:04:51,105 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0424 01:04:52.556000 132482488543040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3378) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is from the 24B_config.json file

55 "clean_chunk_kvrange": 1,
56 "clean_t": 0.9999,
57 "seed": 83746,
58 "num_frames": 121,
59 "video_size_h": 540,
60 "video_size_w": 960,
61 "num_steps": 8,
62 "window_size": 4,
63 "fps": 24,
64 "chunk_width": 6,
65 "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight",
66 "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
67 "t5_device": "cuda",
68 "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
69 "scale_factor": 0.18215,
70 "temporal_downsample_factor": 4

I have no idea what is going on, but the files in the directory configured by the "load" parameter are the same as those on huggingface. I am not sure about this error: " assert os.path.exists(inference_weight_dir)" . I tried changing directories to one level back, bad that did not make a difference. I tried this with both a single L40 and with 2xL40s. I am not sure if that is too low of specs for this or not. I will try one of the other configurations with the other models, but I certainly cannot get this to work.

I used cpp_size = 2 when I was using 2xL40s and cpp_size=1 when using 1xL40.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment