Instructions to use ArnavKewalram/gemma-4-E2B-coder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ArnavKewalram/gemma-4-E2B-coder-v1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1") model = AutoModelForMultimodalLM.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ArnavKewalram/gemma-4-E2B-coder-v1", filename="gemma-4-E2B-coder-v1-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M # Run inference directly in the terminal: llama cli -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M # Run inference directly in the terminal: llama cli -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Use Docker
docker model run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ArnavKewalram/gemma-4-E2B-coder-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArnavKewalram/gemma-4-E2B-coder-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
- SGLang
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ArnavKewalram/gemma-4-E2B-coder-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArnavKewalram/gemma-4-E2B-coder-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ArnavKewalram/gemma-4-E2B-coder-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArnavKewalram/gemma-4-E2B-coder-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Ollama:
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
- Unsloth Studio
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ArnavKewalram/gemma-4-E2B-coder-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ArnavKewalram/gemma-4-E2B-coder-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ArnavKewalram/gemma-4-E2B-coder-v1 to start chatting
- Pi
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Docker Model Runner:
docker model run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
- Lemonade
How to use ArnavKewalram/gemma-4-E2B-coder-v1 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-E2B-coder-v1-Q4_K_M
List all available models
lemonade list
gemma-4-E2B-coder-v1
The first coding fine-tune of google/gemma-4-E2B-it — 34.1% HumanEval pass@1 · matches Code Llama 7B at half the size · runs fully offline on 4 GB RAM · Apache 2.0.
| 🚀 Try the live demo → | No GPU or API key. Works in your browser. |
| 📓 Colab / Jupyter notebook → | Download and run locally in minutes. |
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M |
One command, no Python required. |
| 📦 Full collection → | Model · Space · Training dataset. |
Trained on 10,000 samples from Magicoder-OSS-Instruct-75K — real open-source code instruction pairs extracted from GitHub — using QLoRA on a single RTX 3080. At ~3.2 GB in Q4_K_M (Gemma 4's 262K-token vocabulary means embedding tables alone are ~2 GB), this model runs on laptops and edge devices with 4 GB RAM.
Who is this for?
Use this model if you want a capable coding assistant that:
- Runs fully offline on a laptop or edge device (4 GB RAM minimum with Q4_K_M)
- Requires no GPU — fast CPU inference via Ollama or llama.cpp
- Is Apache 2.0 licensed for commercial use without restrictions
- Needs Python, JavaScript, TypeScript, Go, Rust, SQL, Bash, or C++ support
Not ideal for: very long context tasks (training max was 384 tokens), security-critical code generation, or tasks needing the base model's multimodal capabilities.
| gemma-4-E2B-coder-v1 | Typical 7B coder | |
|---|---|---|
| Size (Q4) | ~3.2 GB | ~4.5 GB |
| Min RAM | 4 GB | 6 GB |
| Runs on CPU | Yes (fast — Griffin arch) | Yes (slow) |
| License | Apache 2.0 | Varies |
| Context at training | 384 tokens | 2K–8K |
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "ArnavKewalram/gemma-4-E2B-coder-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function that checks if a number is prime."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
4-bit quantized (runs on 4 GB VRAM)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"ArnavKewalram/gemma-4-E2B-coder-v1",
quantization_config=bnb,
device_map="auto",
trust_remote_code=True,
)
llama.cpp / Ollama (GGUF — no Python required)
# Q4_K_M — ~3.2 GB, runs on 4 GB RAM (laptop/desktop/edge)
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M
# llama.cpp (Gemma 4 E-series chat format)
./llama-cli -m gemma-4-E2B-coder-v1-Q4_K_M.gguf \
-p "<bos><|turn>user\nWrite a binary search in Python<turn|>\n<|turn>model\n" \
--temp 0.2 -n 512
Available GGUF variants:
| File | Size | Use case |
|---|---|---|
gemma-4-E2B-coder-v1-Q4_K_M.gguf |
~3.2 GB | Best compression; 4 GB RAM minimum |
gemma-4-E2B-coder-v1-Q5_K_M.gguf |
~3.4 GB | Better accuracy, 4 GB RAM minimum |
gemma-4-E2B-coder-v1-Q8_0.gguf |
~4.6 GB | Near-lossless, 6 GB RAM recommended |
gemma-4-E2B-coder-v1-F16.gguf |
~8.6 GB | Full precision BF16, 12 GB RAM |
Note on GGUF sizes: Gemma 4 E2B has an unusually large vocabulary (262,144 tokens vs. ~32K typical). The embedding tables alone account for ~2 GB in the quantized files — larger than a standard Llama 3.2-1B model. Q4_K_M quantizes the embeddings to Q6_K to preserve quality, which explains the larger-than-expected file sizes.
Why This Model?
- Real code patterns — Magicoder extracts instruction pairs from actual GitHub repositories, not synthetic textbook examples
- Griffin architecture — hybrid local-attention + linear recurrent layers gives lower latency than pure-transformer models of the same size
- First-mover — no other gemma-4-E2B coding fine-tune exists as of June 2026
- Fully merged — released as a complete BF16 checkpoint, no adapter files required
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-E2B-it |
| Total parameters | ~3.9B |
| Architecture | Hybrid Griffin (attention + linear recurrent) |
| Fine-tuning method | QLoRA (4-bit NF4, double quant) |
| LoRA rank / alpha | 16 / 32 |
| LoRA targets | q/k/v/o (attention), gate/up/down (MLP) — full-path list targeting, excludes Gemma4ClippableLinear SSM layers |
| Trainable params | 24.2M (0.47% of total) |
| Dataset | Magicoder-OSS-Instruct-75K — 10,000 samples, ~525 steps |
| Max sequence length | 384 tokens |
| Learning rate | 2e-4 (cosine decay, 3% warmup) |
| Batch size | 8 (1 × 8 grad accum) |
| Hardware | NVIDIA RTX 3080 10 GB |
| Training time | ~6 hours |
| License | Apache 2.0 |
Training
Trained with TRL SFTTrainer + PEFT LoRA + bitsandbytes on a single consumer GPU.
Griffin architecture note: The Gemma 4 E-series alternates between standard local-attention layers and Griffin linear-recurrent (SSM) layers. The SSM layers use a custom Gemma4ClippableLinear wrapper that is incompatible with PEFT's default module injection. To work around this, LoRA adapters are injected into a pre-filtered list of 205 Linear4bit instances — verified by isinstance check at load time — covering attention projections (q/k/v/o_proj) and MLP layers (gate/up/down_proj) across all 26 layers, while safely skipping all SSM-layer wrappers. After training, adapters are merged into the base weights using PeftModel.merge_and_unload().
Training curve (logged every 25 steps):
| Step | Loss | Token Accuracy |
|---|---|---|
| 25 | 1.696 | 70.8% |
| 50 | 0.7828 | 79.2% |
| 75 | 0.737 | 80.3% |
| 100 | 0.7311 | 80.4% |
| 125 | 0.6896 | 81.5% |
| 150 | 0.695 | 81.2% |
| 175 | 0.7103 | 81.1% |
| 200 | 0.6588 | 82.0% |
| 225 | 0.6626 | 82.1% |
| 250 | 0.6585 | 82.1% |
| 275 | 0.6768 | 81.8% |
| 300 | 0.6616 | 82.1% |
| 325 | 0.6442 | 82.3% |
| 350 | 0.6538 | 82.2% |
| 375 | 0.6645 | 82.1% |
| 400 | 0.6472 | 82.2% |
| 425 | 0.667 | 81.8% |
| 450 | 0.6735 | 82.0% |
| 475 | 0.6516 | 82.3% |
| 500 | 0.6772 | 81.6% |
| 525 | 0.6407 | 82.7% ← checkpoint saved |
Evaluation
HumanEval (pass@1)
Evaluated on the full OpenAI HumanEval benchmark (164 Python problems) using Q4_K_M GGUF via Ollama, raw completion mode (no chat template), temperature 0.2, 512 max tokens.
| Model | Size | HumanEval pass@1 |
|---|---|---|
| gemma-4-E2B-coder-v1 (Q4_K_M) | 3.9B | 34.1% |
| Code Llama | 7B | 33.5% |
| Qwen2.5-Coder | 1.5B | 37.2% |
| Llama 3.2 | 3B | 25.4% |
| Gemma 2 | 2B | 18.7% |
34.1% pass@1 — competitive with Code Llama 7B at roughly half the parameter count. Notable given the model was fine-tuned on only 10,000 samples with a 384-token context window; longer-context problems are the primary failure mode.
Keyword Score
Keyword-based evaluation on 8 coding prompts using Q4_K_M GGUF (CPU inference, llama.cpp b9684, temperature 0.2):
| Prompt | Keywords checked | Score |
|---|---|---|
| Miller-Rabin primality test | miller, witness, def is_prime |
33% |
| Binary search | mid, lo, hi, def binary_search |
75% |
| Thread-safe LRU cache | OrderedDict, Lock, def get, def put |
100% |
| Recursive list flatten | def flatten, isinstance, list |
100% |
| JavaScript debounce | setTimeout, clearTimeout, function debounce |
100% |
| FizzBuzz | Fizz, Buzz, def |
100% |
| Graph BFS | deque, visited, def bfs |
100% |
| Retry decorator | def retry, wrapper, attempts |
100% |
Average keyword score: 88.5% (8 prompts).
Keyword scoring checks that expected API/structural elements appear in the output — it is a proxy for code correctness, not a formal benchmark. The Miller-Rabin score (33%) is low because the model wrote a functionally correct implementation using variable names a and x rather than the keyword-matched names miller/witness.
Limitations
- 384-token training max; prompts + responses longer than this were truncated during fine-tuning — quality may degrade on very long inputs
- Not evaluated on security-sensitive code generation tasks
- Inherits biases and knowledge cutoff of google/gemma-4-E2B-it
- Text-only; multimodal capabilities of the base model are not fine-tuned here
Citation
@misc{gemma4_2026,
title={Gemma 4: Open Models Based on Gemini Research and Technology},
author={Google DeepMind},
year={2026},
}
@misc{magicoder2023,
title={Magicoder: Source Code Is All You Need},
author={Wei, Yuxiang and Wang, Zhe and Liu, Jiawei and Ding, Yifeng and Zhang, Lingming},
year={2023},
eprint={2312.02120},
archivePrefix={arXiv},
}
- Downloads last month
- 652
Model tree for ArnavKewalram/gemma-4-E2B-coder-v1
Dataset used to train ArnavKewalram/gemma-4-E2B-coder-v1
Space using ArnavKewalram/gemma-4-E2B-coder-v1 1
Collections including ArnavKewalram/gemma-4-E2B-coder-v1
Paper for ArnavKewalram/gemma-4-E2B-coder-v1
Evaluation results
- HumanEval pass@1 (Q4_K_M, temp=0.2, raw completion) on openai_humanevalself-reported34.100
- Keyword Score (8 coding tasks, Q4_K_M) on openai_humanevalself-reported88.500
- Keyword Score – Thread-safe LRU cache on openai_humanevalself-reported100.000
- Keyword Score – Graph BFS on openai_humanevalself-reported100.000
- Keyword Score – JavaScript debounce on openai_humanevalself-reported100.000
- Keyword Score – Retry decorator on openai_humanevalself-reported100.000