Instructions to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized") model = AutoModelForMultimodalLM.from_pretrained("MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
- SGLang
How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with Docker Model Runner:
docker model run hf.co/MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
Hugging Face |
GitHub |
Launch Blog |
Documentation
License: Gemma | Authors: Google DeepMind
Gemma 4 E4B IT — FP8 Optimized for Energy Efficiency
Resilient AI Challenge 2026 — Image-to-Text Category (Round 2 Submission)
Team: MPS AI Resilience Challenge
Base Model
| Property | Value |
|---|---|
| Original model | google/gemma-4-E4B-it |
| Architecture | Gemma4ForConditionalGeneration — Dense transformer with sliding + full attention |
| Effective parameters | ~4.5B active during inference (8B total with embeddings) |
| Hidden size | 2560 |
| Layers | 42 |
| Sliding Window | 512 tokens |
| Context window | 128K tokens (served at 4096 for L4 energy constraints) |
| Vocabulary Size | 262K |
| Modalities | Text + Image (vision encoder with 280 soft tokens per image) |
| Vision Encoder Parameters | ~150M |
Model Capabilities
Gemma 4 E4B is a dense multimodal model from the Gemma 4 family. Key capabilities include:
- Thinking – Built-in reasoning mode with step-by-step thinking before answering
- Image Understanding – Object detection, document/PDF parsing, screen/UI understanding, chart comprehension, OCR (multilingual), handwriting recognition, and pointing
- Interleaved Multimodal Input – Mix text and images in any order within a single prompt
- Function Calling – Native support for structured tool use, enabling agentic workflows
- Coding – Code generation, completion, and correction
- Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages
- Long Context – Native 128K token context window
Compression Techniques Applied
1. FP8 Weight Quantization (compressed-tensors format, text-decoder only)
- Method: FP8 E4M3FN per-tensor symmetric weight quantization (no calibration forward pass needed)
- Format: compressed-tensors
float-quantized— vLLM auto-detects fromconfig.jsonquantization_config - Precision: W8 floating-point (FP8 weights, bf16 activations and compute)
- Quantized layers:
Linearlayers inside the text decoder only (language_model.layers.*) - Preserved in bf16 (listed in
quantization_config.ignore):- Vision encoder (
vision_tower.*) — required so vLLM'sGemma4ForConditionalGenerationcan bind the multimodal towers (which it instantiates as plainnn.Linear, not as quantized linears) - Audio encoder (
audio_tower.*) — same reason; image-to-text category doesn't use audio but the towers ship with the architecture - Multimodal projector (
multi_modal_projector.*) - Output head (
lm_head) and input embeddings (embed_tokens) — tied pertie_word_embeddings: true - Gemma 4-specific
per_layer_input_gate/per_layer_projection - All normalization layers
- Vision encoder (
- Quality impact: small (gated by competition's >=80% threshold)
Why text-decoder only?
vLLM's Gemma 4 model code instantiates the multimodal-tower linears as standard nn.Linear, not as quantized linears. If those weights are pre-packed on disk (as .weight_packed / .weight_scale), vLLM's parameter loader cannot bind them and crashes at load time. Restricting quantization to the text decoder — where the 42 decoder layers dominate both the parameter count and the energy budget — preserves vLLM compatibility while still capturing the bulk of the FP8 energy savings.
2. FP8 KV Cache
- Setting:
kv_cache_dtype: fp8 - Effect: Reduces KV cache memory by ~50%, freeing GPU memory for computation
- Quality impact: Negligible
- Energy reduction: ~15% due to reduced memory bandwidth pressure
3. Reduced Context Window
- Setting:
max_model_len: 4096(vs. model's native 131K) - Rationale: Image-to-text tasks use <2K tokens. Reducing to 4096 minimizes pre-allocated KV cache, improving GPU utilization.
4. CUDA Graphs (enabled by default)
enforce_eagerNOT set — CUDA graphs enabled by default- Eliminates Python scheduling overhead in decode, 15-30% faster inference
5. Chunked Prefill + Prefix Caching
- Chunked prefill: Better GPU utilization during image+text prefill
- Prefix caching: Avoids redundant computation for shared prompts
Serving
vllm serve MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized --config vllm_config.yaml
vLLM Configuration
model: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
tokenizer: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
dtype: bfloat16
max_model_len: 4096
gpu_memory_utilization: 0.90
kv_cache_dtype: fp8
limit_mm_per_prompt:
image: 1
enable_chunked_prefill: true
enable_prefix_caching: true
max_num_seqs: 32
disable_log_requests: true
Docker Deployment (Lightning AI — Tested Command)
This is the exact Docker command used to load and test this checkpoint on Lightning AI (1x NVIDIA L4).
Step 1: Initialize MODEL_DIR
First, set the path to a local directory containing this checkpoint's files (or clone/download this repo):
export MODEL_DIR=/path/to/gemma4-e4b-it-mps-optimized
Example: if you cloned this repo to ~/models/, use:
export MODEL_DIR=~/models/gemma4-e4b-it-mps-optimized
Step 2: Run the Docker container
docker run --rm --gpus all --ipc=host -p 8000:8000 \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
-v "$MODEL_DIR:/model" \
vllm/vllm-openai:v0.23.0-cu129 \
/model \
--tokenizer /model \
--dtype bfloat16 \
--max-model-len 4096 \
--kv-cache-dtype fp8 \
--limit-mm-per-prompt '{"image":1}' \
--enable-chunked-prefill \
--enable-prefix-caching \
--served-model-name gemma4-mps
Flag reference:
-e VLLM_TEST_FORCE_FP8_MARLIN=1= Force FP8 Marlin kernel selection (required for this checkpoint on L4)-v "$MODEL_DIR:/model"= Mount local model directory to/modelinside container (must be absolute path)--dtype bfloat16= Activations and compute in bfloat16 (quantization_config in config.json handles FP8 weight loading automatically)--kv-cache-dtype fp8= Keeps KV cache in FP8 for memory efficiency--max-model-len 4096= Matches the vllm_config.yaml setting--enable-chunked-prefill/--enable-prefix-caching= Same performance optimizations as config file
This is equivalent to vllm serve ... --config vllm_config.yaml above — the Docker form passes flags directly on the CLI instead of via config file, and points to a local model directory instead of the HF repo ID.
Competition Sampling Parameters
Applied per-request by the evaluation harness:
temperature: 1.0top_p: 0.95top_k: 64
Expected Performance
| Metric | Baseline (bf16) | FP8 Optimized | Change |
|---|---|---|---|
| Model size on disk | ~15.3 GB | ~11 GB (text decoder FP8) | ~-25% |
| Inference speed | Reference | ~1.5-2x faster | FP8 tensor cores + CUDA graphs |
| Energy | Reference | ~30-45% less | Significant reduction |
| Quality | Reference | Passes 80% quality gate | Validated on calibration |
Energy Optimization Strategy
The competition ranks by total energy consumed over the benchmark suite:
- FP8 text-decoder weights (compressed-tensors) → FP8 tensor cores on L4 give large GEMM throughput gains where it matters most (the 42 decoder layers dominate the FLOPs budget) = faster = less energy
- FP8 KV cache → Halves cache memory traffic = less energy for attention
- CUDA graphs → Eliminates Python overhead = faster decode = less time on GPU
- Chunked prefill → Better GPU utilization during image processing
- Prefix caching → Avoids redundant computation for repeated prompts
- Reduced max_model_len (4096) → Less pre-allocated memory = more efficient GPU utilization
- Disabled request logging → Reduces I/O overhead during evaluation
Best Practices
For optimal performance, use these configurations:
Sampling Parameters
Use the standardized sampling configuration (applied by the evaluation harness):
temperature=1.0top_p=0.95top_k=64
Thinking Mode
- Trigger Thinking: Include
<|think|>token at the start of the system prompt - Disable Thinking: Remove the token; the model will generate empty thought blocks
- Multi-Turn: In multi-turn conversations, do NOT include thinking content from previous turns
Multimodal Input Order
For optimal performance:
- Place image content before the text in your prompt
- Audio content (if applicable) goes after the text
Variable Image Resolution
Gemma 4 supports variable image resolution through a configurable visual token budget:
- Supported budgets: 70, 140, 280, 560, 1120
- Lower budgets for classification/captioning (faster inference)
- Higher budgets for OCR, document parsing, reading small text
Limitations
- Models generate responses based on training data patterns — they may produce incorrect or outdated factual statements
- Open-ended or highly complex tasks might be challenging
- Natural language ambiguity (sarcasm, figurative language) can be difficult
- Performance influenced by amount of context provided
Who We Are
Two engineers from Bucharest, Romania — not a typical ML research team. We're enterprise engineers who work with large, complex systems for a living and decided to take on an AI compression challenge.
Team: Mihai Peti & Sonia Frumuseanu
HuggingFace: mihaipeti2009 & frumuseanus
- Mihai Peti — AI Engineer, RAG/LLM systems, 18 years in enterprise software
mihaipeti.vercel.app · linkedin.com/in/mihaipeti - Sonia Frumuseanu — Senior SAP ABAP Consultant
linkedin.com/in/sonia-frumuseanu
Development Environment
All development and testing was done on Lightning AI:
| Component | Spec |
|---|---|
| GPU | NVIDIA L4 Tensor Core |
| VRAM | 24 GB |
| vCPUs | 8 |
| RAM | 32 GB |
| TFLOPs (BF16/FP16) | 121 |
| TOPS (INT8) | 242.5 |
| TOPS (INT4) | 485 |
This matches the competition's evaluation hardware (1x NVIDIA L4).
License
This model is distributed under the Gemma Terms of Use, consistent with the original google/gemma-4-E4B-it model license.
Acknowledgments
- Google DeepMind for the Gemma 4 model family
- The Resilient AI Challenge organizers (France, India, UNESCO, Sustainable AI Coalition)
- Lightning AI for GPU compute resources
- Downloads last month
- 74