Gemma 4 Gem E4B

Format: 4-bit NF4 | Base: google/gemma-4-E4B-it | 7.5B params

Gemma 4 Gem E4B Multimodal

A fine-tuned version of google/gemma-4-E4B-it optimized for local coding, tool use, instruction following, and structured output. Trained on an H100 via 5-stage distillation pipeline (SFT + DPO + HF curriculum augmentation).

Hard fails reduced by 76% vs stock E4B (17 → 4 on full40 benchmark).

Model Details

This model includes the original vision and audio encoders from google/gemma-4-E4B-it, allowing it to process images, audio, and text inputs. The text backbone has been fine-tuned; vision/audio encoders are from the base model.

Property Value
Base Model google/gemma-4-E4B-it (7.5B)
Training 5-stage CUDA pipeline (SFT → DPO → HF Curriculum)
Quantization 4-bit NF4 (bitsandbytes) — full BF16 base on request
Context 2048 tokens
Format ChatML-style with `<

Training Datasets

  • stage_elite_blend (1,920 rows) — Openthoughts/Hermes/XLam gold-standard reasoning
  • Agentic CoT Coding SFT (429 rows) — Multi-step coding agent tasks
  • Glaive Function Calling v2 (1,000 rows) — Tool-use and JSON schema compliance

Benchmark Results

Comparison against stock google/gemma-4-E4B-it (4-bit):

Benchmark Stock E4B Gemma 4 Gem E4B Improvement
full40 253/400 (6.33, 17 HF) 259/400 (6.47, 4 HF) +6 pts, -13 HF
code_smoke 89/120 (7.42, 1 HF) beats gate
json_hard 30/30 (10.0, 0 HF) perfect
false_premise_smoke 87/110 (7.91, 0 HF) clean
math_smoke 41/60 (6.83, 0 HF) clean

Leaderboard Context

Model full40 Hard Fails
Gemma 4 Gem E4B 259 4 🏆
Gemma 4 31B (cloud) 261 15
Chimera v4 (Gemma E2B) 258 14
Stock E4B (4-bit) 253 17
Granite 4.1 8B 276 14
Phi-4 Mini 223 20

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "stamsam/Gemma_4_Gem_e4b_multimodal_4-bit-NF4"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a function to deduplicate a list"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Hardware: NVIDIA H100 80GB SXM
  • Framework: PyTorch 2.11 + PEFT + bitsandbytes 4-bit QLoRA
  • LoRA config: r=8, alpha=16, target_modules=q/k/v/o/gate/up/down_proj
  • Training time: ~30 minutes total across all stages
Downloads last month
-
Safetensors
Model size
6B params
Tensor type
BF16
·
F32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stamsam/Gemma_4_Gem_e4b_multimodal-NF4

Finetuned
(182)
this model