Instructions to use rafw007/gemma4-26b-claude-coder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rafw007/gemma4-26b-claude-coder-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="rafw007/gemma4-26b-claude-coder-GGUF",
	filename="gemma4-26b-claude-coder-Q5_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use rafw007/gemma4-26b-claude-coder-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
# Run inference directly in the terminal:
llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
# Run inference directly in the terminal:
llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Use Docker

docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

LM Studio
Jan

vLLM

How to use rafw007/gemma4-26b-claude-coder-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rafw007/gemma4-26b-claude-coder-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rafw007/gemma4-26b-claude-coder-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Ollama
How to use rafw007/gemma4-26b-claude-coder-GGUF with Ollama:
```
ollama run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
```

Unsloth Studio

How to use rafw007/gemma4-26b-claude-coder-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting

How to use rafw007/gemma4-26b-claude-coder-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use rafw007/gemma4-26b-claude-coder-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Run Hermes

hermes

Docker Model Runner
How to use rafw007/gemma4-26b-claude-coder-GGUF with Docker Model Runner:
```
docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
```

Lemonade

How to use rafw007/gemma4-26b-claude-coder-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M

Run and chat with the model

lemonade run user.gemma4-26b-claude-coder-GGUF-Q5_K_M

List all available models

lemonade list

Gemma 4 26B Claude Coder — local coding agent

A custom model built on Gemma 4 26B (dense, ~25.8B params), tuned to act as an autonomous coding and administration agent. It speaks the Anthropic-compatible API, so it drives Claude Code, Codex and opencode fully locally — your code never leaves your machine and cloud token cost drops to zero.

This is the 32 GB-class big sibling of the Gemma 4 Claude Coder family (E2B / E4B). It ships on a Q5_K_M GGUF quantization, deliberately chosen over Q4_K_M: the smaller Q4_K_M build injected token corruption into long code generations (broken tags, glued digit-letter tokens), and Q5_K_M fixes it — long files come out clean. The system prompt focuses on real work inside a codebase: use tools instead of guessing, write files instead of pasting, ground every answer in real tool output (never fabricate results), stay in one language, and always finish the file you start. No-think mode is wired into the system prompt for fast, direct answers.

Models in the family

Model	Base	Context	Purpose
gemma4-26b-claude-coder	Gemma 4 26B (dense ~25.8B, Q5_K_M)	64K (native 256K)	Strongest member — heavier reasoning and clean long-code generation on 32 GB hardware.
gemma4-e4b-claude-coder	Gemma 4 E4B (eff. 4B / 8B w/ embeddings)	64K	Stronger 16 GB coder — reasoning and tool use on larger tasks.
gemma4-e2b-claude-coder	Gemma 4 E2B (eff. 2B / 5.1B w/ embeddings)	64K	Fast everyday 16 GB coder — edits, autocomplete, short agent loops.

What it's for

Driving Claude Code / Codex / opencode locally (ollama launch claude --model rafw007/gemma4-26b-claude-coder).
Agentic code writing and editing with native function calling / tool use.
Administration and devops tasks on a server — real nmap, df, du with no hallucinated output.
Full privacy and offline operation — no code sent to the cloud.

Measured behavior (June 2026 tests)

Tool-calling without hallucination — real message.tool_calls, and admin tasks (df/du, full /24 nmap scans with host tables) report the actual output rather than inventing it.
Clean long code (the headline fix) — a full pygame Tetris generated complete, runnable and syntactically valid (200+ lines), with zero corruption signatures and zero language drift on a task that broke under the Q4_K_M build.
Guardrails intact — this is the non-abliterated base, so it refuses to write malware.
No-think holds on the direct path — empty thinking field, content is clean.

Context

64K tokens configured — matching Claude Code's recommendation (64K minimum).
Base Gemma 4 26B natively supports 256K, and on 32 GB Apple Silicon the engine serves the full native window 100% on GPU (sliding-window attention keeps the KV cache small), so context never becomes the bottleneck.

Test hardware

The model was built and tested on:

Mac Studio M2, 32 GB-class — Ollama 0.30, GPU (Metal) inference
Mac Mini M4, 32 GB RAM, macOS — Ollama 0.30, GPU (Metal) inference (32 GB-class target)
Quantization: Q5_K_M (~~21 GB) and Q6_K (~~23 GB) GGUF builds available

Measured performance

Placement	Hardware	Speed	Tool calling
100% GPU, native ctx, CONTEXT 65536	Mac Studio M2	~52-56 tok/s	native, real `message.tool_calls`

The model loads entirely on the GPU with no CPU spill (verified via ollama ps: 100% GPU, CONTEXT 65536). The only real cost is a one-time cold load of the ~21 GB weights, not a per-turn cost; warm generation runs ~52-56 tok/s on the Studio. The Mac Mini M4 (32 GB) is the same 32 GB target class — bounded by memory bandwidth rather than the model.

No-think mode

The whole Gemma 4 family has thinking baked into the weights. The system prompt ships with /nothink + an anti-reasoning instruction, which works on the direct API path and under opencode/codex. Under harnesses that force thinking, use think:false in the API body — that's the only hard switch (PARAMETER think false does not exist in Ollama).

Note on long code generation

Q5_K_M removes the bulk corruption seen on the smaller Q4_K_M build — in testing, long single-pass generations came out clean (zero .->- glitches, zero language drift). If you generate files for production, a quick corruption scan before use is still good practice, but the Q5_K_M build tested clean on the long-code task that previously failed.

How it was made

Designed, built and tested with the help of Claude Opus 4.8 — the best coding model in the world. Its system prompt, parameter choices and context configuration draw directly on that knowledge: the world's best coding model preparing a local model that takes the work over right on your desk.

Available files

File	Quant	Size	Notes
`gemma4-26b-claude-coder-Q5_K_M.gguf`	Q5_K_M	~21 GB	Recommended balance of quality/size; fits 32 GB with full 64K ctx.
`gemma4-26b-claude-coder-Q6_K.gguf`	Q6_K	~23 GB	Closer to the original weights; slightly larger/slower.

Both are derived from the same google/gemma-4-26B-A4B-it base and carry the identical Claude Coder system prompt and parameters (see Modelfile).

License

Apache 2.0 (inherited from the base Gemma 4).

Downloads last month: 286

GGUF

Model size

25B params

Architecture

gemma4

Hardware compatibility

5-bit

6-bit

Model tree for rafw007/gemma4-26b-claude-coder-GGUF

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(231)

this model