google/gemma-3-12b-it · Issue with vLLM Deployment of gemma-3-12b-it on Tesla T4

24 days ago

Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiled

Deployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

adhiraj135

23 days ago

having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template

adhiraj135

23 days ago

hey just soved this

you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine

adhiraj135

23 days ago

Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiled

Deployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine

twodaix

21 days ago

•

edited 21 days ago

having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template

How to perform specific operations to solve this problem? Please provide guidance, thank you

twodaix changed discussion status to closed 21 days ago

google
/

gemma-3-12b-it

Issue with vLLM Deployment of gemma-3-12b-it on Tesla T4 - No Output