gemma-4-E2B-coder-v1

The first coding fine-tune of google/gemma-4-E2B-it34.1% HumanEval pass@1 · matches Code Llama 7B at half the size · runs fully offline on 4 GB RAM · Apache 2.0.

🚀 Try the live demo → No GPU or API key. Works in your browser.
📓 Colab / Jupyter notebook → Download and run locally in minutes.
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M One command, no Python required.
📦 Full collection → Model · Space · Training dataset.

Trained on 10,000 samples from Magicoder-OSS-Instruct-75K — real open-source code instruction pairs extracted from GitHub — using QLoRA on a single RTX 3080. At ~3.2 GB in Q4_K_M (Gemma 4's 262K-token vocabulary means embedding tables alone are ~2 GB), this model runs on laptops and edge devices with 4 GB RAM.


Who is this for?

Use this model if you want a capable coding assistant that:

  • Runs fully offline on a laptop or edge device (4 GB RAM minimum with Q4_K_M)
  • Requires no GPU — fast CPU inference via Ollama or llama.cpp
  • Is Apache 2.0 licensed for commercial use without restrictions
  • Needs Python, JavaScript, TypeScript, Go, Rust, SQL, Bash, or C++ support

Not ideal for: very long context tasks (training max was 384 tokens), security-critical code generation, or tasks needing the base model's multimodal capabilities.

gemma-4-E2B-coder-v1 Typical 7B coder
Size (Q4) ~3.2 GB ~4.5 GB
Min RAM 4 GB 6 GB
Runs on CPU Yes (fast — Griffin arch) Yes (slow)
License Apache 2.0 Varies
Context at training 384 tokens 2K–8K

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ArnavKewalram/gemma-4-E2B-coder-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function that checks if a number is prime."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True)

print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

4-bit quantized (runs on 4 GB VRAM)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("ArnavKewalram/gemma-4-E2B-coder-v1", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "ArnavKewalram/gemma-4-E2B-coder-v1",
    quantization_config=bnb,
    device_map="auto",
    trust_remote_code=True,
)

llama.cpp / Ollama (GGUF — no Python required)

# Q4_K_M — ~3.2 GB, runs on 4 GB RAM (laptop/desktop/edge)
ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M

# llama.cpp (Gemma 4 E-series chat format)
./llama-cli -m gemma-4-E2B-coder-v1-Q4_K_M.gguf \
  -p "<bos><|turn>user\nWrite a binary search in Python<turn|>\n<|turn>model\n" \
  --temp 0.2 -n 512

Available GGUF variants:

File Size Use case
gemma-4-E2B-coder-v1-Q4_K_M.gguf ~3.2 GB Best compression; 4 GB RAM minimum
gemma-4-E2B-coder-v1-Q5_K_M.gguf ~3.4 GB Better accuracy, 4 GB RAM minimum
gemma-4-E2B-coder-v1-Q8_0.gguf ~4.6 GB Near-lossless, 6 GB RAM recommended
gemma-4-E2B-coder-v1-F16.gguf ~8.6 GB Full precision BF16, 12 GB RAM

Note on GGUF sizes: Gemma 4 E2B has an unusually large vocabulary (262,144 tokens vs. ~32K typical). The embedding tables alone account for ~2 GB in the quantized files — larger than a standard Llama 3.2-1B model. Q4_K_M quantizes the embeddings to Q6_K to preserve quality, which explains the larger-than-expected file sizes.


Why This Model?

  • Real code patterns — Magicoder extracts instruction pairs from actual GitHub repositories, not synthetic textbook examples
  • Griffin architecture — hybrid local-attention + linear recurrent layers gives lower latency than pure-transformer models of the same size
  • First-mover — no other gemma-4-E2B coding fine-tune exists as of June 2026
  • Fully merged — released as a complete BF16 checkpoint, no adapter files required

Model Details

Property Value
Base model google/gemma-4-E2B-it
Total parameters ~3.9B
Architecture Hybrid Griffin (attention + linear recurrent)
Fine-tuning method QLoRA (4-bit NF4, double quant)
LoRA rank / alpha 16 / 32
LoRA targets q/k/v/o (attention), gate/up/down (MLP) — full-path list targeting, excludes Gemma4ClippableLinear SSM layers
Trainable params 24.2M (0.47% of total)
Dataset Magicoder-OSS-Instruct-75K — 10,000 samples, ~525 steps
Max sequence length 384 tokens
Learning rate 2e-4 (cosine decay, 3% warmup)
Batch size 8 (1 × 8 grad accum)
Hardware NVIDIA RTX 3080 10 GB
Training time ~6 hours
License Apache 2.0

Training

Trained with TRL SFTTrainer + PEFT LoRA + bitsandbytes on a single consumer GPU.

Griffin architecture note: The Gemma 4 E-series alternates between standard local-attention layers and Griffin linear-recurrent (SSM) layers. The SSM layers use a custom Gemma4ClippableLinear wrapper that is incompatible with PEFT's default module injection. To work around this, LoRA adapters are injected into a pre-filtered list of 205 Linear4bit instances — verified by isinstance check at load time — covering attention projections (q/k/v/o_proj) and MLP layers (gate/up/down_proj) across all 26 layers, while safely skipping all SSM-layer wrappers. After training, adapters are merged into the base weights using PeftModel.merge_and_unload().

Training curve (logged every 25 steps):

Step Loss Token Accuracy
25 1.696 70.8%
50 0.7828 79.2%
75 0.737 80.3%
100 0.7311 80.4%
125 0.6896 81.5%
150 0.695 81.2%
175 0.7103 81.1%
200 0.6588 82.0%
225 0.6626 82.1%
250 0.6585 82.1%
275 0.6768 81.8%
300 0.6616 82.1%
325 0.6442 82.3%
350 0.6538 82.2%
375 0.6645 82.1%
400 0.6472 82.2%
425 0.667 81.8%
450 0.6735 82.0%
475 0.6516 82.3%
500 0.6772 81.6%
525 0.6407 82.7% ← checkpoint saved

Evaluation

HumanEval (pass@1)

Evaluated on the full OpenAI HumanEval benchmark (164 Python problems) using Q4_K_M GGUF via Ollama, raw completion mode (no chat template), temperature 0.2, 512 max tokens.

Model Size HumanEval pass@1
gemma-4-E2B-coder-v1 (Q4_K_M) 3.9B 34.1%
Code Llama 7B 33.5%
Qwen2.5-Coder 1.5B 37.2%
Llama 3.2 3B 25.4%
Gemma 2 2B 18.7%

34.1% pass@1 — competitive with Code Llama 7B at roughly half the parameter count. Notable given the model was fine-tuned on only 10,000 samples with a 384-token context window; longer-context problems are the primary failure mode.

Keyword Score

Keyword-based evaluation on 8 coding prompts using Q4_K_M GGUF (CPU inference, llama.cpp b9684, temperature 0.2):

Prompt Keywords checked Score
Miller-Rabin primality test miller, witness, def is_prime 33%
Binary search mid, lo, hi, def binary_search 75%
Thread-safe LRU cache OrderedDict, Lock, def get, def put 100%
Recursive list flatten def flatten, isinstance, list 100%
JavaScript debounce setTimeout, clearTimeout, function debounce 100%
FizzBuzz Fizz, Buzz, def 100%
Graph BFS deque, visited, def bfs 100%
Retry decorator def retry, wrapper, attempts 100%

Average keyword score: 88.5% (8 prompts).

Keyword scoring checks that expected API/structural elements appear in the output — it is a proxy for code correctness, not a formal benchmark. The Miller-Rabin score (33%) is low because the model wrote a functionally correct implementation using variable names a and x rather than the keyword-matched names miller/witness.


Limitations

  • 384-token training max; prompts + responses longer than this were truncated during fine-tuning — quality may degrade on very long inputs
  • Not evaluated on security-sensitive code generation tasks
  • Inherits biases and knowledge cutoff of google/gemma-4-E2B-it
  • Text-only; multimodal capabilities of the base model are not fine-tuned here

Citation

@misc{gemma4_2026,
  title={Gemma 4: Open Models Based on Gemini Research and Technology},
  author={Google DeepMind},
  year={2026},
}

@misc{magicoder2023,
  title={Magicoder: Source Code Is All You Need},
  author={Wei, Yuxiang and Wang, Zhe and Liu, Jiawei and Ding, Yifeng and Zhang, Lingming},
  year={2023},
  eprint={2312.02120},
  archivePrefix={arXiv},
}
Downloads last month
652
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for ArnavKewalram/gemma-4-E2B-coder-v1

Quantized
(252)
this model

Dataset used to train ArnavKewalram/gemma-4-E2B-coder-v1

Space using ArnavKewalram/gemma-4-E2B-coder-v1 1

Collections including ArnavKewalram/gemma-4-E2B-coder-v1

Paper for ArnavKewalram/gemma-4-E2B-coder-v1

Evaluation results

  • HumanEval pass@1 (Q4_K_M, temp=0.2, raw completion) on openai_humaneval
    self-reported
    34.100
  • Keyword Score (8 coding tasks, Q4_K_M) on openai_humaneval
    self-reported
    88.500
  • Keyword Score – Thread-safe LRU cache on openai_humaneval
    self-reported
    100.000
  • Keyword Score – Graph BFS on openai_humaneval
    self-reported
    100.000
  • Keyword Score – JavaScript debounce on openai_humaneval
    self-reported
    100.000
  • Keyword Score – Retry decorator on openai_humaneval
    self-reported
    100.000