Instructions to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized")
model = AutoModelForMultimodalLM.from_pretrained("MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized

SGLang

How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized with Docker Model Runner:
```
docker model run hf.co/MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Hugging Face | GitHub | Launch Blog | Documentation
License: Gemma | Authors: Google DeepMind

Gemma 4 E4B IT — FP8 Optimized for Energy Efficiency

Resilient AI Challenge 2026 — Image-to-Text Category (Round 2 Submission)
Team: MPS AI Resilience Challenge

Base Model

Property	Value
Original model	`google/gemma-4-E4B-it`
Architecture	`Gemma4ForConditionalGeneration` — Dense transformer with sliding + full attention
Effective parameters	~4.5B active during inference (8B total with embeddings)
Hidden size	2560
Layers	42
Sliding Window	512 tokens
Context window	128K tokens (served at 4096 for L4 energy constraints)
Vocabulary Size	262K
Modalities	Text + Image (vision encoder with 280 soft tokens per image)
Vision Encoder Parameters	~150M

Model Capabilities

Gemma 4 E4B is a dense multimodal model from the Gemma 4 family. Key capabilities include:

Thinking – Built-in reasoning mode with step-by-step thinking before answering
Image Understanding – Object detection, document/PDF parsing, screen/UI understanding, chart comprehension, OCR (multilingual), handwriting recognition, and pointing
Interleaved Multimodal Input – Mix text and images in any order within a single prompt
Function Calling – Native support for structured tool use, enabling agentic workflows
Coding – Code generation, completion, and correction
Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages
Long Context – Native 128K token context window

Compression Techniques Applied

1. FP8 Weight Quantization (compressed-tensors format, text-decoder only)

Method: FP8 E4M3FN per-tensor symmetric weight quantization (no calibration forward pass needed)
Format: compressed-tensors float-quantized — vLLM auto-detects from config.json quantization_config
Precision: W8 floating-point (FP8 weights, bf16 activations and compute)
Quantized layers: Linear layers inside the text decoder only (language_model.layers.*)
Preserved in bf16 (listed in quantization_config.ignore):
- Vision encoder (vision_tower.*) — required so vLLM's Gemma4ForConditionalGeneration can bind the multimodal towers (which it instantiates as plain nn.Linear, not as quantized linears)
- Audio encoder (audio_tower.*) — same reason; image-to-text category doesn't use audio but the towers ship with the architecture
- Multimodal projector (multi_modal_projector.*)
- Output head (lm_head) and input embeddings (embed_tokens) — tied per tie_word_embeddings: true
- Gemma 4-specific per_layer_input_gate / per_layer_projection
- All normalization layers
Quality impact: small (gated by competition's >=80% threshold)

Why text-decoder only?

vLLM's Gemma 4 model code instantiates the multimodal-tower linears as standard nn.Linear, not as quantized linears. If those weights are pre-packed on disk (as .weight_packed / .weight_scale), vLLM's parameter loader cannot bind them and crashes at load time. Restricting quantization to the text decoder — where the 42 decoder layers dominate both the parameter count and the energy budget — preserves vLLM compatibility while still capturing the bulk of the FP8 energy savings.

2. FP8 KV Cache

Setting: kv_cache_dtype: fp8
Effect: Reduces KV cache memory by ~50%, freeing GPU memory for computation
Quality impact: Negligible
Energy reduction: ~15% due to reduced memory bandwidth pressure

3. Reduced Context Window

Setting: max_model_len: 4096 (vs. model's native 131K)
Rationale: Image-to-text tasks use <2K tokens. Reducing to 4096 minimizes pre-allocated KV cache, improving GPU utilization.

4. CUDA Graphs (enabled by default)

enforce_eager NOT set — CUDA graphs enabled by default
Eliminates Python scheduling overhead in decode, 15-30% faster inference

5. Chunked Prefill + Prefix Caching

Chunked prefill: Better GPU utilization during image+text prefill
Prefix caching: Avoids redundant computation for shared prompts

Serving

vllm serve MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized --config vllm_config.yaml

vLLM Configuration

model: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
tokenizer: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
dtype: bfloat16
max_model_len: 4096
gpu_memory_utilization: 0.90
kv_cache_dtype: fp8
limit_mm_per_prompt:
  image: 1
enable_chunked_prefill: true
enable_prefix_caching: true
max_num_seqs: 32
disable_log_requests: true

Docker Deployment (Lightning AI — Tested Command)

This is the exact Docker command used to load and test this checkpoint on Lightning AI (1x NVIDIA L4).

Step 1: Initialize MODEL_DIR

First, set the path to a local directory containing this checkpoint's files (or clone/download this repo):

export MODEL_DIR=/path/to/gemma4-e4b-it-mps-optimized

Example: if you cloned this repo to ~/models/, use:

export MODEL_DIR=~/models/gemma4-e4b-it-mps-optimized

Step 2: Run the Docker container

docker run --rm --gpus all --ipc=host -p 8000:8000 \
    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
    -v "$MODEL_DIR:/model" \
    vllm/vllm-openai:v0.23.0-cu129 \
    /model \
    --tokenizer /model \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --limit-mm-per-prompt '{"image":1}' \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --served-model-name gemma4-mps

Flag reference:

-e VLLM_TEST_FORCE_FP8_MARLIN=1 = Force FP8 Marlin kernel selection (required for this checkpoint on L4)
-v "$MODEL_DIR:/model" = Mount local model directory to /model inside container (must be absolute path)
--dtype bfloat16 = Activations and compute in bfloat16 (quantization_config in config.json handles FP8 weight loading automatically)
--kv-cache-dtype fp8 = Keeps KV cache in FP8 for memory efficiency
--max-model-len 4096 = Matches the vllm_config.yaml setting
--enable-chunked-prefill / --enable-prefix-caching = Same performance optimizations as config file

This is equivalent to vllm serve ... --config vllm_config.yaml above — the Docker form passes flags directly on the CLI instead of via config file, and points to a local model directory instead of the HF repo ID.

Competition Sampling Parameters

Applied per-request by the evaluation harness:

temperature: 1.0
top_p: 0.95
top_k: 64

Expected Performance

Metric	Baseline (bf16)	FP8 Optimized	Change
Model size on disk	~15.3 GB	~11 GB (text decoder FP8)	~-25%
Inference speed	Reference	~1.5-2x faster	FP8 tensor cores + CUDA graphs
Energy	Reference	~30-45% less	Significant reduction
Quality	Reference	Passes 80% quality gate	Validated on calibration

Energy Optimization Strategy

The competition ranks by total energy consumed over the benchmark suite:

FP8 text-decoder weights (compressed-tensors) → FP8 tensor cores on L4 give large GEMM throughput gains where it matters most (the 42 decoder layers dominate the FLOPs budget) = faster = less energy
FP8 KV cache → Halves cache memory traffic = less energy for attention
CUDA graphs → Eliminates Python overhead = faster decode = less time on GPU
Chunked prefill → Better GPU utilization during image processing
Prefix caching → Avoids redundant computation for repeated prompts
Reduced max_model_len (4096) → Less pre-allocated memory = more efficient GPU utilization
Disabled request logging → Reduces I/O overhead during evaluation

Best Practices

For optimal performance, use these configurations:

Sampling Parameters

Use the standardized sampling configuration (applied by the evaluation harness):

temperature=1.0
top_p=0.95
top_k=64

Thinking Mode

Trigger Thinking: Include <|think|> token at the start of the system prompt
Disable Thinking: Remove the token; the model will generate empty thought blocks
Multi-Turn: In multi-turn conversations, do NOT include thinking content from previous turns

Multimodal Input Order

For optimal performance:

Place image content before the text in your prompt
Audio content (if applicable) goes after the text

Variable Image Resolution

Gemma 4 supports variable image resolution through a configurable visual token budget:

Supported budgets: 70, 140, 280, 560, 1120
Lower budgets for classification/captioning (faster inference)
Higher budgets for OCR, document parsing, reading small text

Limitations

Models generate responses based on training data patterns — they may produce incorrect or outdated factual statements
Open-ended or highly complex tasks might be challenging
Natural language ambiguity (sarcasm, figurative language) can be difficult
Performance influenced by amount of context provided

Who We Are

Two engineers from Bucharest, Romania — not a typical ML research team. We're enterprise engineers who work with large, complex systems for a living and decided to take on an AI compression challenge.

Team: Mihai Peti & Sonia Frumuseanu
HuggingFace: mihaipeti2009 & frumuseanus

Mihai Peti — AI Engineer, RAG/LLM systems, 18 years in enterprise software
mihaipeti.vercel.app · linkedin.com/in/mihaipeti
Sonia Frumuseanu — Senior SAP ABAP Consultant
linkedin.com/in/sonia-frumuseanu

Development Environment

All development and testing was done on Lightning AI:

Component	Spec
GPU	NVIDIA L4 Tensor Core
VRAM	24 GB
vCPUs	8
RAM	32 GB
TFLOPs (BF16/FP16)	121
TOPS (INT8)	242.5
TOPS (INT4)	485

This matches the competition's evaluation hardware (1x NVIDIA L4).

License

This model is distributed under the Gemma Terms of Use, consistent with the original google/gemma-4-E4B-it model license.

Acknowledgments

Google DeepMind for the Gemma 4 model family
The Resilient AI Challenge organizers (France, India, UNESCO, Sustainable AI Coalition)
Lightning AI for GPU compute resources

Downloads last month: 74

Safetensors

Model size

8B params

Tensor type

F32

BF16

F8_E4M3

Model tree for MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(251)

this model