Instructions to use rafw007/gemma4-26b-claude-coder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rafw007/gemma4-26b-claude-coder-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rafw007/gemma4-26b-claude-coder-GGUF", filename="gemma4-26b-claude-coder-Q5_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rafw007/gemma4-26b-claude-coder-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M # Run inference directly in the terminal: llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M # Run inference directly in the terminal: llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Use Docker
docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use rafw007/gemma4-26b-claude-coder-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rafw007/gemma4-26b-claude-coder-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rafw007/gemma4-26b-claude-coder-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
- Ollama
How to use rafw007/gemma4-26b-claude-coder-GGUF with Ollama:
ollama run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
- Unsloth Studio
How to use rafw007/gemma4-26b-claude-coder-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rafw007/gemma4-26b-claude-coder-GGUF to start chatting
- Pi
How to use rafw007/gemma4-26b-claude-coder-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rafw007/gemma4-26b-claude-coder-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rafw007/gemma4-26b-claude-coder-GGUF with Docker Model Runner:
docker model run hf.co/rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
- Lemonade
How to use rafw007/gemma4-26b-claude-coder-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rafw007/gemma4-26b-claude-coder-GGUF:Q5_K_M
Run and chat with the model
lemonade run user.gemma4-26b-claude-coder-GGUF-Q5_K_M
List all available models
lemonade list
Gemma 4 26B Claude Coder — local coding agent
A custom model built on Gemma 4 26B (dense, ~25.8B params), tuned to act as an autonomous coding and administration agent. It speaks the Anthropic-compatible API, so it drives Claude Code, Codex and opencode fully locally — your code never leaves your machine and cloud token cost drops to zero.
This is the 32 GB-class big sibling of the Gemma 4 Claude Coder family (E2B / E4B). It ships on a Q5_K_M GGUF quantization, deliberately chosen over Q4_K_M: the smaller Q4_K_M build injected token corruption into long code generations (broken tags, glued digit-letter tokens), and Q5_K_M fixes it — long files come out clean. The system prompt focuses on real work inside a codebase: use tools instead of guessing, write files instead of pasting, ground every answer in real tool output (never fabricate results), stay in one language, and always finish the file you start. No-think mode is wired into the system prompt for fast, direct answers.
Models in the family
| Model | Base | Context | Purpose |
|---|---|---|---|
| gemma4-26b-claude-coder | Gemma 4 26B (dense ~25.8B, Q5_K_M) | 64K (native 256K) | Strongest member — heavier reasoning and clean long-code generation on 32 GB hardware. |
| gemma4-e4b-claude-coder | Gemma 4 E4B (eff. 4B / 8B w/ embeddings) | 64K | Stronger 16 GB coder — reasoning and tool use on larger tasks. |
| gemma4-e2b-claude-coder | Gemma 4 E2B (eff. 2B / 5.1B w/ embeddings) | 64K | Fast everyday 16 GB coder — edits, autocomplete, short agent loops. |
What it's for
- Driving Claude Code / Codex / opencode locally (
ollama launch claude --model rafw007/gemma4-26b-claude-coder). - Agentic code writing and editing with native function calling / tool use.
- Administration and devops tasks on a server — real
nmap,df,duwith no hallucinated output. - Full privacy and offline operation — no code sent to the cloud.
Measured behavior (June 2026 tests)
- Tool-calling without hallucination — real
message.tool_calls, and admin tasks (df/du, full/24nmap scans with host tables) report the actual output rather than inventing it. - Clean long code (the headline fix) — a full pygame Tetris generated complete, runnable and syntactically valid (200+ lines), with zero corruption signatures and zero language drift on a task that broke under the Q4_K_M build.
- Guardrails intact — this is the non-abliterated base, so it refuses to write malware.
- No-think holds on the direct path — empty thinking field, content is clean.
Context
- 64K tokens configured — matching Claude Code's recommendation (64K minimum).
- Base Gemma 4 26B natively supports 256K, and on 32 GB Apple Silicon the engine serves the full native window 100% on GPU (sliding-window attention keeps the KV cache small), so context never becomes the bottleneck.
Test hardware
The model was built and tested on:
- Mac Studio M2, 32 GB-class — Ollama 0.30, GPU (Metal) inference
- Mac Mini M4, 32 GB RAM, macOS — Ollama 0.30, GPU (Metal) inference (32 GB-class target)
- Quantization: Q5_K_M (
21 GB) and Q6_K (23 GB) GGUF builds available
Measured performance
| Placement | Hardware | Speed | Tool calling |
|---|---|---|---|
| 100% GPU, native ctx, CONTEXT 65536 | Mac Studio M2 | ~52-56 tok/s | native, real message.tool_calls |
The model loads entirely on the GPU with no CPU spill (verified via ollama ps: 100% GPU,
CONTEXT 65536). The only real cost is a one-time cold load of the ~21 GB weights, not a per-turn
cost; warm generation runs ~52-56 tok/s on the Studio. The Mac Mini M4 (32 GB) is the same 32 GB
target class — bounded by memory bandwidth rather than the model.
No-think mode
The whole Gemma 4 family has thinking baked into the weights. The system prompt ships with
/nothink + an anti-reasoning instruction, which works on the direct API path and under
opencode/codex. Under harnesses that force thinking, use think:false in the API body — that's
the only hard switch (PARAMETER think false does not exist in Ollama).
Note on long code generation
Q5_K_M removes the bulk corruption seen on the smaller Q4_K_M build — in testing, long single-pass
generations came out clean (zero .->- glitches, zero language drift). If you generate files for
production, a quick corruption scan before use is still good practice, but the Q5_K_M build tested
clean on the long-code task that previously failed.
How it was made
Designed, built and tested with the help of Claude Opus 4.8 — the best coding model in the world. Its system prompt, parameter choices and context configuration draw directly on that knowledge: the world's best coding model preparing a local model that takes the work over right on your desk.
Available files
| File | Quant | Size | Notes |
|---|---|---|---|
gemma4-26b-claude-coder-Q5_K_M.gguf |
Q5_K_M | ~21 GB | Recommended balance of quality/size; fits 32 GB with full 64K ctx. |
gemma4-26b-claude-coder-Q6_K.gguf |
Q6_K | ~23 GB | Closer to the original weights; slightly larger/slower. |
Both are derived from the same google/gemma-4-26B-A4B-it base and carry the identical Claude Coder
system prompt and parameters (see Modelfile).
License
Apache 2.0 (inherited from the base Gemma 4).
- Downloads last month
- 286
5-bit
6-bit