Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-26B-A4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it") model = AutoModelForMultimodalLM.from_pretrained("google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- AMD Developer Cloud
- Local Apps Settings
- vLLM
How to use google/gemma-4-26B-A4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-26B-A4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-26B-A4B-it
- SGLang
How to use google/gemma-4-26B-A4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-26B-A4B-it
Chat template may re-inject prior-turn reasoning during multi-turn tool use → repetition loops
Summary
The chat_template shipped with this model may be vulnerable to a multi-turn
reasoning re-injection issue that can cause verbatim repetition loops during
agentic / tool-calling use (e.g. OpenCode, llama.cpp --jinja, any harness that sends
back reasoning_content on prior assistant tool-call steps).
It is harmless for single-turn chat — it only triggers in a multi-step tool-calling
sequence — which is why it does not show up in standard single-turn benchmarks.
Mechanism
The template renders each prior assistant turn's thinking back into the prompt:
{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
{{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}
The guard loop.index0 > last_user_idx and message.tool_calls restricts this to the
assistant tool-call steps that occur after the final user message — i.e. exactly the
in-request agentic loop. Each new step is therefore prompted with all of its own previous
private thoughts re-injected as <|channel>thought blocks. As the agentic context grows,
the model is fed an accumulating echo of its own reasoning and can collapse into a
repetition loop.
Scope
Detected by static inspection of the chat_template across the Gemma 4 instruct family
and several third-party requants — the re-injection block is present in all of them. We
have dynamically reproduced and fixed the loop on one pruned-MoE derivative; on other
sizes/quants the same code path is present but we have not measured loop severity directly,
hence "may be vulnerable."
Tests we ran (on our derivative)
Same engine / weights / seeds — only the rendered prompt varied:
| Condition | Multi-turn agentic loop rate | HumanEval+ (Q6_K) | MultiPL-E-100 (Q6_K) |
|---|---|---|---|
| Stock template (re-injection on) | 33% (4/12 seeds) | — | — |
| Fixed template (re-injection off) | 0% | 92.07% | 0.66 |
Single-turn code/instruction scores are unchanged by the fix (the re-injection path is
multi-turn-only), so the fix carries no quality cost.
Fix
Disable the historical-reasoning re-injection (keep the thinking channel only for the
current generation). The minimal change is to make that {%- if ... -%} guard never fire:
{%- if false and thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
A drop-in corrected template (validated as above) is here:
https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/chat_template.fixed.jinja
For GGUF files the template can be rewritten in-place without re-quantizing, viagguf_new_metadata.py --chat-template-file chat_template.fixed.jinja (llama.cpp gguf-py).
A 1-minute unit test to check any repo / template
Runnable version: https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v7-coder-it-GGUF/blob/main/template_loop_unittest.py
from transformers import AutoTokenizer
SENTINEL = "ZZ_HISTORY_THOUGHT_SENTINEL_ZZ"
# One user request, then a multi-step tool-calling sequence (real agentic shape).
conv = [
{"role": "user", "content": "Build a small Three.js website with a rotating cube."},
{"role": "assistant", "content": "", "reasoning_content": SENTINEL,
"tool_calls": [{"type": "function", "function": {"name": "write_file",
"arguments": {"path": "index.html", "content": "<html></html>"}}}]},
{"role": "tool", "content": "index.html written"},
{"role": "assistant", "content": "", "reasoning_content": "Now wire up the JS.",
"tool_calls": [{"type": "function", "function": {"name": "write_file",
"arguments": {"path": "main.js", "content": "//"}}}]},
{"role": "tool", "content": "main.js written"},
]
tok = AutoTokenizer.from_pretrained("<this-repo-or-local-dir>")
rendered = tok.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
leaks = rendered.count(SENTINEL)
assert leaks == 0, f"MAY BE VULNERABLE: prior-turn reasoning re-injected {leaks}x as a thinking channel"
print("OK: no history-reasoning re-injection")
On the current template this asserts (sentinel re-injected); on the corrected template it
prints OK.
Hope this is useful — feel free to ignore if your harness never replays reasoning_content.
This is an automated message generated from a template audit across the Gemma 4 family
and its requants. It is shared for your awareness; no action is required on our part.
Massive thanks! I've been hitting these repetition loops for weeks, and this fix for the Jinja template is the first one that actually works. This should definitely be included in the official chat template. Much appreciated!