Releasing gemma-4-E2B-coder-v1 — the first coding fine-tune of Gemma 4 E2B

#1
by ArnavKewalram - opened

Releasing gemma-4-E2B-coder-v1 — the first coding fine-tune of Gemma 4 E2B

I just finished training gemma-4-E2B-coder-v1, the first coding fine-tune of google/gemma-4-E2B-it, and I wanted to share some hard-won lessons about fine-tuning this unusual architecture.

What it is

A QLoRA fine-tune of Gemma 4 E2B (3.9B params) on 10,000 samples from Magicoder-OSS-Instruct-75K — real instruction pairs extracted from open-source GitHub repositories.

The Q4_K_M GGUF is ~3.2 GB — larger than a typical sub-3B model because Gemma 4 has a 262K-token vocabulary (the embedding tables alone are ~2 GB).

Try it now:

  • 🤗 Live demo Space (no GPU or API key needed)
  • 📓 Notebook (view on HF Hub — download to run in Colab/Jupyter)
  • 🖥️ Ollama: ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M

Technical lessons — the hard way

1. Griffin architecture breaks PEFT suffix-matching

Gemma 4 E-series isn't a standard transformer — it alternates between local-attention layers and Griffin linear-recurrent (SSM) layers. The SSM layers wrap their projections in Gemma4ClippableLinear, which PEFT can't inject LoRA into.

What doesn't work: standard suffix-based target_modules:

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", ...]  # crashes
# ValueError: Target module Gemma4ClippableLinear(...) is not supported

Even though you exclude input_proj/output_proj, the SSM layers also have q_proj/k_proj sub-modules wrapped in Gemma4ClippableLinear.

What works: filter by isinstance(mod, Linear4bit) at load time, then pass full paths as a list:

from bitsandbytes.nn import Linear4bit

_SUFFIX_TARGETS = {"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"}

lora_target_modules = [
    name for name, mod in model.named_modules()
    if isinstance(mod, Linear4bit) and name.split(".")[-1] in _SUFFIX_TARGETS
]
# → 205 verified Linear4bit layers, 0 Gemma4ClippableLinear

lora_config = LoraConfig(..., target_modules=lora_target_modules)

Why a list, not a regex? Passing a list lets PEFT do an O(1) set lookup per module. Passing a 254-path alternation regex string means re.fullmatch(14k_char_regex, key) for every module in a 3.9B model — that took 14+ minutes at 97% CPU.

2. Dataset ordering matters more than you think

The first 10K samples of Magicoder-OSS-Instruct-75K are sorted by sequence length. Without shuffling, step times escalated from ~50s to 426s/step by step 80 (estimated total: 128 hours on an RTX 3080).

After shuffling before selection:

ds = load_dataset("ise-uiuc/Magicoder-OSS-Instruct-75K", split="train")
ds = ds.shuffle(seed=77).select(range(10000))  # shuffle FIRST, then select

Step times stabilized at ~8s and training completed in ~6 hours.

3. VRAM cliffs with gradient checkpointing

With max_seq_len=512 and gradient checkpointing, training hung at 99% VRAM after step 24 with GPU pinned at 100% indefinitely. Reducing to max_seq_len=384 gave ~223 MB headroom and stabilized training.


Training stats

Parameter Value
Base model google/gemma-4-E2B-it (3.9B)
Dataset Magicoder-OSS-Instruct-75K, 10K samples, 1 epoch
LoRA rank / alpha 16 / 32
Trainable parameters 24.2M (0.47%)
Max sequence length 384 tokens
Hardware NVIDIA RTX 3080 10 GB
Training time ~6 hours

GGUF variants

File Size Use case
Q4_K_M ~3.2 GB Best compression; 4 GB RAM
Q5_K_M ~3.4 GB Better accuracy
Q8_0 ~4.6 GB Near-lossless, desktop

Larger than typical sub-3B quants because Gemma 4's 262K vocabulary means embedding tables are ~2 GB even at Q4.

Full details and training curve: ArnavKewalram/gemma-4-E2B-coder-v1

Would love feedback — especially if you run it on edge hardware or compare against other sub-3B coders!

Update: quantitative eval results added (88.5% avg)

Just ran an 8-prompt keyword eval on the Q4_K_M GGUF (CPU, llama.cpp b9684, temp 0.2):

Prompt Score
Miller-Rabin primality test 33% (correct implementation, different var names)
Binary search 75%
Thread-safe LRU cache 100%
Recursive list flatten 100%
JavaScript debounce 100%
FizzBuzz 100%
Graph BFS 100%
Retry decorator 100%

Average: 88.5% -- details in the updated model card.

Also fixed today:

  • tokenizer_config.json: extra_special_tokens format (fixes load error in transformers 4.57+)
  • Modelfile: corrected chat template to Gemma 4 E-series format (<|turn> / <turn|>)
  • inference.py + notebook: added rust_remote_code=True

Would love to hear results if you test on edge hardware or compare against other sub-4B coders!

Supports Python, JS, TS, Go, Rust, SQL, Bash, C++ with streaming output. Model is pre-cached at build time so first generation takes ~30-60s (previously 2-4 min). Run locally with just 4 GB RAM: ollama run hf.co/ArnavKewalram/gemma-4-E2B-coder-v1:Q4_K_M

Sign up or log in to comment