Issue with vLLM Deployment of gemma-3-12b-it on Tesla T4 - No Output

#16
by twodaix - opened

Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiled

Deployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template

hey just soved this

you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine

Hello everyone, I’m trying to deploy the gemma-3-12b-it model using vLLM on two Tesla T4 GPUs, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-12b-it
Transformers Version: transformers-4.49.0-Gemma-3
vLLM Version: 0.8.0rc3.dev5+g5eeabc2a.precompiled

Deployment Command
vllm serve /data/gemma/gemma-3-12b-it
--served-model-name gemma-3-12b-it
--dtype=float16
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 2
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-12b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-12b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

you mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicelty mention it and you command will work fine

having the same issue with vllm v0 aysncn llm engine running on 100 40 GB stile blank response is bein outputted "" now issue the reason behind it using the autoprocessor.apply_chat_template

How to perform specific operations to solve this problem? Please provide guidance, thank you

twodaix changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment