Getting the following error when trying to run 24B model

#3
by Gemneye - opened
  1. It looks like matplotlib and rich modules were not included in requirements.txt
  2. I am getting the below error. Running on RunPod with ADA6000, with conda environment using the commands from the documentation, Ubuntu 22.04

(magi) root@a86f02cd24e3:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
Traceback (most recent call last):
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
main()
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
pipeline = MagiPipeline(args.config_file)
File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 29, in init
self.config = MagiConfig.from_json(config_path)
File "/workspace/MAGI-1/inference/common/config.py", line 159, in from_json
post_validation(magi_config)
File "/workspace/MAGI-1/inference/common/config.py", line 154, in post_validation
magi_config.runtime_config.cfg_number == 1
AssertionError: Please set cfg_number: 1 in config.json for distill or quant model
E0423 01:33:36.885000 124462029551424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 5233) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_01:33:36
host : a86f02cd24e3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5233)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

It looks like you’re using a distill or quant model. Please make sure that cfg_number is set to 1 in your config.json file.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment