Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-4-26B-A4B-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it")
model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
AMD Developer Cloud
Local Apps Settings

vLLM

How to use google/gemma-4-26B-A4B-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-4-26B-A4B-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-4-26B-A4B-it

SGLang

How to use google/gemma-4-26B-A4B-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-4-26B-A4B-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-4-26B-A4B-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-4-26B-A4B-it
```

Chat template may re-inject prior-turn reasoning during multi-turn tool use → repetition loops

#48

by ManniX-ITA - opened 16 days ago

Discussion

ManniX-ITA

16 days ago

Summary

The chat_template shipped with this model may be vulnerable to a multi-turn
reasoning re-injection issue that can cause verbatim repetition loops during
agentic / tool-calling use (e.g. OpenCode, llama.cpp --jinja, any harness that sends
back reasoning_content on prior assistant tool-call steps).

It is harmless for single-turn chat — it only triggers in a multi-step tool-calling
sequence — which is why it does not show up in standard single-turn benchmarks.

Mechanism

The template renders each prior assistant turn's thinking back into the prompt:

{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
    {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}

The guard loop.index0 > last_user_idx and message.tool_calls restricts this to the
assistant tool-call steps that occur after the final user message — i.e. exactly the
in-request agentic loop. Each new step is therefore prompted with all of its own previous
private thoughts re-injected as <|channel>thought blocks. As the agentic context grows,
the model is fed an accumulating echo of its own reasoning and can collapse into a
repetition loop.

Scope

Detected by static inspection of the chat_template across the Gemma 4 instruct family
and several third-party requants — the re-injection block is present in all of them. We
have dynamically reproduced and fixed the loop on one pruned-MoE derivative; on other
sizes/quants the same code path is present but we have not measured loop severity directly,
hence "may be vulnerable."

Tests we ran (on our derivative)

Same engine / weights / seeds — only the rendered prompt varied:

Condition	Multi-turn agentic loop rate	HumanEval+ (Q6_K)	MultiPL-E-100 (Q6_K)
Stock template (re-injection on)	33% (4/12 seeds)	—	—
Fixed template (re-injection off)	0%	92.07%	0.66

Single-turn code/instruction scores are unchanged by the fix (the re-injection path is
multi-turn-only), so the fix carries no quality cost.

Fix

Disable the historical-reasoning re-injection (keep the thinking channel only for the
current generation). The minimal change is to make that {%- if ... -%} guard never fire:

{%- if false and thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}

A drop-in corrected template (validated as above) is here:
https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/chat_template.fixed.jinja

For GGUF files the template can be rewritten in-place without re-quantizing, via
gguf_new_metadata.py --chat-template-file chat_template.fixed.jinja (llama.cpp gguf-py).

A 1-minute unit test to check any repo / template

Runnable version: https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/template_loop_unittest.py

from transformers import AutoTokenizer

SENTINEL = "ZZ_HISTORY_THOUGHT_SENTINEL_ZZ"
# One user request, then a multi-step tool-calling sequence (real agentic shape).
conv = [
    {"role": "user", "content": "Build a small Three.js website with a rotating cube."},
    {"role": "assistant", "content": "", "reasoning_content": SENTINEL,
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "index.html", "content": "<html></html>"}}}]},
    {"role": "tool", "content": "index.html written"},
    {"role": "assistant", "content": "", "reasoning_content": "Now wire up the JS.",
     "tool_calls": [{"type": "function", "function": {"name": "write_file",
       "arguments": {"path": "main.js", "content": "//"}}}]},
    {"role": "tool", "content": "main.js written"},
]

tok = AutoTokenizer.from_pretrained("<this-repo-or-local-dir>")
rendered = tok.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
leaks = rendered.count(SENTINEL)
assert leaks == 0, f"MAY BE VULNERABLE: prior-turn reasoning re-injected {leaks}x as a thinking channel"
print("OK: no history-reasoning re-injection")

On the current template this asserts (sentinel re-injected); on the corrected template it
prints OK.

Hope this is useful — feel free to ignore if your harness never replays reasoning_content.

This is an automated message generated from a template audit across the Gemma 4 family
and its requants. It is shared for your awareness; no action is required on our part.

fredericodeveloper

2 days ago

Massive thanks! I've been hitting these repetition loops for weeks, and this fix for the Jinja template is the first one that actually works. This should definitely be included in the official chat template. Much appreciated!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment