Loading with Quantization?
#12
by
danielplominski
- opened
Hello Mistralai Team,
is there any chance of loading this model on less powerful hardware?
Our biggest VM can use 2x NVIDIA A6000 cards.
Docker does not work:
#!/bin/sh export CUDA_VISIBLE_DEVICES="0,1" docker run \ --gpus='"device=0,1"' \ --runtime nvidia \ -v /opt/cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=SECRET" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model mistralai/Pixtral-Large-Instruct-2411 \ --tokenizer_mode mistral \ --load_format mistral \ --config_format mistral \ --limit_mm_per_prompt 'image=10' \ --tensor-parallel-size 8 \ --max_model_len=1024 \ --quantization=fp8 # EOF
Errors:
... ... ... INFO 11-20 01:54:28 config.py:1861] Downcasting torch.float32 to torch.float16. INFO 11-20 01:54:28 config.py:1020] Defaulting to use ray for distributed inference WARNING 11-20 01:54:28 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-20 01:54:28 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space. INFO 11-20 01:54:33 config.py:1020] Defaulting to use ray for distributed inference WARNING 11-20 01:54:33 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-20 01:54:33 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space. 2024-11-20 01:54:35,429 INFO worker.py:1819 -- Started a local Ray instance. Process SpawnProcess-1: ERROR 11-20 01:54:36 engine.py:366] The number of required GPUs exceeds the total number of available GPUs in the placement group. ERROR 11-20 01:54:36 engine.py:366] Traceback (most recent call last): ERROR 11-20 01:54:36 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine ERROR 11-20 01:54:36 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ... ... ... File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. root@ai-ubuntu22gpu-big:/opt#