Model does not run with VLLM
I try to run the starter code that you all gave in the models readme but it seems it is not loaded into memory why might this be.
This is my output it gets stuck at this point and does not load the model into memory:
INFO 12-17 07:38:41 config.py:1020] Defaulting to use mp for distributed inference
WARNING 12-17 07:38:41 arg_utils.py:1023] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 12-17 07:38:41 config.py:503] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 12-17 07:38:41 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic', speculative_config=None, tokenizer='neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
WARNING 12-17 07:38:41 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 12-17 07:38:41 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 12-17 07:38:42 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=252259) INFO 12-17 07:38:42 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=252259) INFO 12-17 07:38:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=252259) INFO 12-17 07:38:42 utils.py:961] Found nccl from library libnccl.so.2
INFO 12-17 07:38:42 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=252259) INFO 12-17 07:38:42 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 12-17 07:38:42 pynccl.py:69] vLLM is using nccl==2.21.5
What GPUs are you using? For the command in the README, we tested on 4xH100
vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
Maybe you could fit on smaller GPUs by decreasing --max-num-seqs
or using a smaller --max-model-len
like the warning says.
I have 2x A100 one with 40GB VRAM and one with 80GB but it seems whenever I pass in the tensor parallel size parameter the output gets stuck, I have modified both max-num-seqs and max-model-len to see if it helps but for some reason the parallelization argument is causing issues?