Instructions to use stamsam/Gemma_4_Gem_e4b_multimodal-NF4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use stamsam/Gemma_4_Gem_e4b_multimodal-NF4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="stamsam/Gemma_4_Gem_e4b_multimodal-NF4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("stamsam/Gemma_4_Gem_e4b_multimodal-NF4") model = AutoModelForImageTextToText.from_pretrained("stamsam/Gemma_4_Gem_e4b_multimodal-NF4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use stamsam/Gemma_4_Gem_e4b_multimodal-NF4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stamsam/Gemma_4_Gem_e4b_multimodal-NF4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stamsam/Gemma_4_Gem_e4b_multimodal-NF4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/stamsam/Gemma_4_Gem_e4b_multimodal-NF4
- SGLang
How to use stamsam/Gemma_4_Gem_e4b_multimodal-NF4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "stamsam/Gemma_4_Gem_e4b_multimodal-NF4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stamsam/Gemma_4_Gem_e4b_multimodal-NF4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "stamsam/Gemma_4_Gem_e4b_multimodal-NF4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stamsam/Gemma_4_Gem_e4b_multimodal-NF4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use stamsam/Gemma_4_Gem_e4b_multimodal-NF4 with Docker Model Runner:
docker model run hf.co/stamsam/Gemma_4_Gem_e4b_multimodal-NF4

Format: 4-bit NF4 | Base: google/gemma-4-E4B-it | 7.5B params
Gemma 4 Gem E4B Multimodal
A fine-tuned version of google/gemma-4-E4B-it optimized for local coding, tool use, instruction following, and structured output. Trained on an H100 via 5-stage distillation pipeline (SFT + DPO + HF curriculum augmentation).
Hard fails reduced by 76% vs stock E4B (17 → 4 on full40 benchmark).
Model Details
This model includes the original vision and audio encoders from google/gemma-4-E4B-it, allowing it to process images, audio, and text inputs. The text backbone has been fine-tuned; vision/audio encoders are from the base model.
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E4B-it (7.5B) |
| Training | 5-stage CUDA pipeline (SFT → DPO → HF Curriculum) |
| Quantization | 4-bit NF4 (bitsandbytes) — full BF16 base on request |
| Context | 2048 tokens |
| Format | ChatML-style with `< |
Training Datasets
- stage_elite_blend (1,920 rows) — Openthoughts/Hermes/XLam gold-standard reasoning
- Agentic CoT Coding SFT (429 rows) — Multi-step coding agent tasks
- Glaive Function Calling v2 (1,000 rows) — Tool-use and JSON schema compliance
Benchmark Results
Comparison against stock google/gemma-4-E4B-it (4-bit):
| Benchmark | Stock E4B | Gemma 4 Gem E4B | Improvement |
|---|---|---|---|
| full40 | 253/400 (6.33, 17 HF) | 259/400 (6.47, 4 HF) | +6 pts, -13 HF |
| code_smoke | — | 89/120 (7.42, 1 HF) | beats gate |
| json_hard | — | 30/30 (10.0, 0 HF) | perfect |
| false_premise_smoke | — | 87/110 (7.91, 0 HF) | clean |
| math_smoke | — | 41/60 (6.83, 0 HF) | clean |
Leaderboard Context
| Model | full40 | Hard Fails |
|---|---|---|
| Gemma 4 Gem E4B | 259 | 4 🏆 |
| Gemma 4 31B (cloud) | 261 | 15 |
| Chimera v4 (Gemma E2B) | 258 | 14 |
| Stock E4B (4-bit) | 253 | 17 |
| Granite 4.1 8B | 276 | 14 |
| Phi-4 Mini | 223 | 20 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "stamsam/Gemma_4_Gem_e4b_multimodal_4-bit-NF4"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a function to deduplicate a list"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
- Hardware: NVIDIA H100 80GB SXM
- Framework: PyTorch 2.11 + PEFT + bitsandbytes 4-bit QLoRA
- LoRA config: r=8, alpha=16, target_modules=q/k/v/o/gate/up/down_proj
- Training time: ~30 minutes total across all stages
- Downloads last month
- -