RuntimeError: start (0) + length (1280) exceeds dimension size (1024).

#2
by ouuo - opened

Is it OK that when I try to run it with vllm it says this error

ModelCloud.AI org

@ouuo Update vllm to latest release and enable flashinfer. The error should go away.

Thank you for the answer. The error did not go away. Here is the log.
VLLM_ATTENTION_BACKEND=FLASHINFER added and vllm is 0.5.2.
Can you tell me what causes this error.

2024-07-23T00:25:05.520915025Z INFO 07-23 00:25:05 api_server.py:212] vLLM API server version 0.5.2
2024-07-23T00:25:05.521125152Z INFO 07-23 00:25:05 api_server.py:213] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='ModelCloud/Mistral-Nemo-Instruct-2407-gptq-4bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gptq', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-07-23T00:25:05.786126741Z WARNING 07-23 00:25:05 config.py:1378] Casting torch.bfloat16 to torch.float16.
2024-07-23T00:25:05.820162709Z INFO 07-23 00:25:05 gptq_marlin.py:91] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
2024-07-23T00:25:05.820204789Z WARNING 07-23 00:25:05 config.py:241] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-07-23T00:25:05.821898929Z INFO 07-23 00:25:05 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='ModelCloud/Mistral-Nemo-Instruct-2407-gptq-4bit', speculative_config=None, tokenizer='ModelCloud/Mistral-Nemo-Instruct-2407-gptq-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=ModelCloud/Mistral-Nemo-Instruct-2407-gptq-4bit, use_v2_block_manager=False, enable_prefix_caching=False)
2024-07-23T00:25:06.741343648Z INFO 07-23 00:25:06 selector.py:79] Using Flashinfer backend.
2024-07-23T00:25:07.020780406Z INFO 07-23 00:25:07 selector.py:79] Using Flashinfer backend.
2024-07-23T00:25:07.669493581Z INFO 07-23 00:25:07 weight_utils.py:218] Using model weights format ['
.safetensors']
2024-07-23T00:25:07.986041995Z INFO 07-23 00:25:07 weight_utils.py:261] No model.safetensors.index.json found in remote.
2024-07-23T00:25:09.347538872Z [rank0]: Traceback (most recent call last):
2024-07-23T00:25:09.347590740Z [rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-07-23T00:25:09.347595840Z [rank0]: return _run_code(code, main_globals, None,
2024-07-23T00:25:09.347600569Z [rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-07-23T00:25:09.347604767Z [rank0]: exec(code, run_globals)
2024-07-23T00:25:09.347608905Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 282, in
2024-07-23T00:25:09.347613634Z [rank0]: run_server(args)
2024-07-23T00:25:09.347618022Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server
2024-07-23T00:25:09.347622230Z [rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
2024-07-23T00:25:09.347626408Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
2024-07-23T00:25:09.347630626Z [rank0]: engine = cls(
2024-07-23T00:25:09.347634854Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 373, in init
2024-07-23T00:25:09.347639593Z [rank0]: self.engine = self._init_engine(*args, **kwargs)
2024-07-23T00:25:09.347643721Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 520, in _init_engine
2024-07-23T00:25:09.347647838Z [rank0]: return engine_class(*args, **kwargs)
2024-07-23T00:25:09.347652026Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init
2024-07-23T00:25:09.347656264Z [rank0]: self.model_executor = executor_class(
2024-07-23T00:25:09.347660422Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 150, in init
2024-07-23T00:25:09.347664580Z [rank0]: super().init(model_config, cache_config, parallel_config,
2024-07-23T00:25:09.347668688Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in init
2024-07-23T00:25:09.347672785Z [rank0]: self._init_executor()
2024-07-23T00:25:09.347676903Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 25, in _init_executor
2024-07-23T00:25:09.347681011Z [rank0]: self.driver_worker.load_model()
2024-07-23T00:25:09.347685149Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
2024-07-23T00:25:09.347689277Z [rank0]: self.model_runner.load_model()
2024-07-23T00:25:09.347693505Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 256, in load_model
2024-07-23T00:25:09.347701319Z [rank0]: self.model = get_model(model_config=self.model_config,
2024-07-23T00:25:09.347718141Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
2024-07-23T00:25:09.347722469Z [rank0]: return loader.load_model(model_config=model_config,
2024-07-23T00:25:09.347726657Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 270, in load_model
2024-07-23T00:25:09.347730855Z [rank0]: model.load_weights(
2024-07-23T00:25:09.347735013Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 455, in load_weights
2024-07-23T00:25:09.347739211Z [rank0]: weight_loader(param, loaded_weight, shard_id)
2024-07-23T00:25:09.347743359Z [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 623, in weight_loader
2024-07-23T00:25:09.347747467Z [rank0]: loaded_weight = loaded_weight.narrow(output_dim, start_idx,
2024-07-23T00:25:09.347751604Z [rank0]: RuntimeError: start (0) + length (1280) exceeds dimension size (1024).

Use the latest vllm, 0.5.3.p9st1

ok, thank you, that works

Sign up or log in to comment