Instructions to use google/gemma-4-31B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-4-31B-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-4-31B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-31B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
AMD Developer Cloud
Local Apps Settings

vLLM

How to use google/gemma-4-31B-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-4-31B-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-4-31B-it

SGLang

How to use google/gemma-4-31B-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-4-31B-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-4-31B-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-31B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-4-31B-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-4-31B-it
```

Chat template issues with multiple rounds of tool calling

#115

by Kimahriman - opened 10 days ago

Discussion

Kimahriman

10 days ago

•

edited 10 days ago

The current chat template doesn't work well with multiple rounds of tool calling, especially if a tool call message has content associated with it. I commented on a vLLM issue here, cross posting here for visibility.

There are two issues that seem to be at play here:

turn handling across multiple rounds of tool calling
location of content from a tool_call message in the prompt (which is already discussed in this discussion)

In the prompt guide for these models, it suggests that content in a tool_calls message is commentary on the result of the tool_responses for those calls, when it reality, at least with the OpenAI chat structure, content in a tool_calls message is commentary on what the tool calls are for and why they are needed before there are any tool_responses. Commentary on the result of the tool_calls would be in a following assistant message after all the role: "tool" messages with the tool responses. I'm not sure what the "legacy" tool_responses behavior is supposed to be or what tools would use that structure so I can't comment on that.

An example of what this looks like (ignoring system prompt and tool definitions):

for messages:

[
  ...,
  {
    "role": "assistant",
    "reasoning": "I should call ToolA to load ToolB",
    "content": "I will start by searching loading ToolB using ToolA",
    "tool_calls": [
      {"id": "call_001", "type": "function", "function": {"name": "ToolA", "arguments": "{\"x\": \"load ToolB\"}"}}
    ]
  },
  {"role": "tool", "tool_call_id": "call_001", "content": "Success: ToolB is now available"}
]

gets rendered as:

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|><|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>I will start by loading ToolB using ToolA<turn|>

No new <|turn>model\n gets rendered because the previous message was a tool_response. This seems to cause invalid reasoning to be output in certain cases, particularly with the 26B variant. 31B seems slightly more resilient to this prompt but still can have issues based on our observations.

If there is no content with the tool_calls message, it gets rendered as:

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|><|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>

which seems to produce valid reasoning in the response.

Trying out some fixes, simply removing the <turn|> when there is content does not seem to help. Two fixes that do seem to work (anecdotally based on some sample queries and small Claude Code sessions via vLLM):

Start a new turn when there is content, so the prompt to the model is <turn|><|turn>model\n. llama.cpp manually does this in post processing to fix some invalid reasoning https://github.com/ggml-org/llama.cpp/pull/21760/changes#diff-2580689a73e0d42f06cf4b5ed02e7cf87c6fc874343bce9c6f0e1ebf190d95b7R1095-R1100
Keep multiple rounds of tool calling in a single turn, but move the content to be rendered where it likely should be based on how the model outputs the content:

...
<|turn>model
<|channel>thought
I should call ToolA to load ToolB<channel|>I will start by loading ToolB using ToolA<|tool_call>call:ToolA{x:<|"|>load ToolB<|"|>}<tool_call|><|tool_response>response:ToolA{value:<|"|>Success: ToolB is now available<|"|>}<tool_response|>

The question is which one is more "correct". Are multiple rounds of tool calling supposed to be in a single turn, or should each round be it's own turn? And should the content be moved in the prompt to align with how it is output?

Kimahriman

10 days ago

https://huggingface.co/google/gemma-4-12B-it/discussions/12
https://github.com/vllm-project/vllm/pull/42776

kenleo

6 days ago

I have frequently experienced cheating session feedback, e.g.,

<|tool_call>call:bash{command:<|"|>python3 /Volumes/Elements/VR-Patent/generate_drawio_files.py && ls ./outputs/*.drawio<|"|>,description:<|"|>Execute the drawIO file generator and verify the output files<|"|>}<tool_call|>

It reported that it had just finished something, resulting in a fake operation. So I have to tell it about it. It politely apologizes and says it will correct it immediately, but similar problems recur every time. This process of repeatedly pointing out problems, making corrections, and repeating mistakes is inhumane.

kenleo

5 days ago

•

edited 5 days ago

I think I have just solved this problem. I used the jinja file given in this example https://recipes.vllm.ai/Google/gemma-4-31B-it?hardware=h100. In my case, gemma-4-31B-it performed what it reported, even though I can still see the tool calling chain of thinking. This happens in OpenCode.
But it still fails in Copilot and Codex.

thnamratha

Google org 1 day ago

Hi all,

Thanks for addressing this issue and providing details. We have escalated this issue to our internal team for further investigation.

coder543

1 day ago

This has been an ongoing issue for months: https://www.reddit.com/r/LocalLLaMA/comments/1smffwl/issues_with_gemma_4_tool_calling_abrupt_gen/

Gemma 4 is very unreliable for tool calling/agentic use cases in my experience. In almost all cases, it will make at most one successful tool call before responding, even when its chain of thought shows that it wants to make another tool call. It will only make a second tool call if the first one failed with an error. Often times, especially with the 31B model, there is no chain of thought after the first tool call at all, and it just jumps straight into a response.

I'm glad you are escalating this issue, and I hope Google will fix Gemma 4's tool calling issues.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment