Issue with vLLM Deployment of gemma-3-12b-it on Tesla T4 - No Output
Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!
Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiled
Deployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000
Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'
Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?
Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!
having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template
hey just soved this
you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine
Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!
Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiledDeployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!
you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine
having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template
How to perform specific operations to solve this problem? Please provide guidance, thank you