TypeError: 'NoneType' object is not callable
!vllm serve "google/gemma-2-2b-it" --dtype=half
Error:
INFO 01-14 09:41:39 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 01-14 09:41:39 api_server.py:220] args: Namespace(model_tag='/kaggle/input/gemma/transformers/2b-it/3', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/kaggle/input/gemma/transformers/2b-it/3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7b974351add0>)
WARNING 01-14 09:41:39 config.py:1425] Casting torch.bfloat16 to torch.float16.
INFO 01-14 09:41:39 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/kaggle/input/gemma/transformers/2b-it/3', speculative_config=None, tokenizer='/kaggle/input/gemma/transformers/2b-it/3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/kaggle/input/gemma/transformers/2b-it/3, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 01-14 09:41:40 selector.py:80] Using Flashinfer backend.
INFO 01-14 09:41:41 model_runner.py:680] Starting to load model /kaggle/input/gemma/transformers/2b-it/3...
INFO 01-14 09:41:41 selector.py:80] Using Flashinfer backend.
WARNING 01-14 09:41:41 gemma.py:56] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (gelu_pytorch_tanh
). If you want to use the legacy gelu
, edit the config JSON to set hidden_activation=gelu
instead of hidden_act
. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:23<00:23, 23.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:23<00:00, 9.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:23<00:00, 11.79s/it]
INFO 01-14 09:42:04 model_runner.py:692] Loading model weights took 4.7384 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/bin/vllm", line 8, in
[rank0]: sys.exit(main())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main
[rank0]: args.dispatch_function(args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve
[rank0]: run_server(args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]: BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable
Hi @pofce ,
I executed the code in Google Colab using the "T4 GPU" runtime type. I was able to reproduce the issue. Please install the following libraries with their specified versions to avoid above issue. For further details, you can refer to the linked gist file.
vllm == 0.6.6.post1
torch == 2.5.1+cu121
transformers == 4.48.0
Thank you.