You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Hugging Face | GitHub | Launch Blog | Documentation
License: Gemma | Authors: Google DeepMind

Gemma 4 E4B IT — FP8 Optimized for Energy Efficiency

Resilient AI Challenge 2026 — Image-to-Text Category (Round 2 Submission)
Team: MPS AI Resilience Challenge

Base Model

Property Value
Original model google/gemma-4-E4B-it
Architecture Gemma4ForConditionalGeneration — Dense transformer with sliding + full attention
Effective parameters ~4.5B active during inference (8B total with embeddings)
Hidden size 2560
Layers 42
Sliding Window 512 tokens
Context window 128K tokens (served at 4096 for L4 energy constraints)
Vocabulary Size 262K
Modalities Text + Image (vision encoder with 280 soft tokens per image)
Vision Encoder Parameters ~150M

Model Capabilities

Gemma 4 E4B is a dense multimodal model from the Gemma 4 family. Key capabilities include:

  • Thinking – Built-in reasoning mode with step-by-step thinking before answering
  • Image Understanding – Object detection, document/PDF parsing, screen/UI understanding, chart comprehension, OCR (multilingual), handwriting recognition, and pointing
  • Interleaved Multimodal Input – Mix text and images in any order within a single prompt
  • Function Calling – Native support for structured tool use, enabling agentic workflows
  • Coding – Code generation, completion, and correction
  • Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages
  • Long Context – Native 128K token context window

Compression Techniques Applied

1. FP8 Weight Quantization (compressed-tensors format, text-decoder only)

  • Method: FP8 E4M3FN per-tensor symmetric weight quantization (no calibration forward pass needed)
  • Format: compressed-tensors float-quantized — vLLM auto-detects from config.json quantization_config
  • Precision: W8 floating-point (FP8 weights, bf16 activations and compute)
  • Quantized layers: Linear layers inside the text decoder only (language_model.layers.*)
  • Preserved in bf16 (listed in quantization_config.ignore):
    • Vision encoder (vision_tower.*) — required so vLLM's Gemma4ForConditionalGeneration can bind the multimodal towers (which it instantiates as plain nn.Linear, not as quantized linears)
    • Audio encoder (audio_tower.*) — same reason; image-to-text category doesn't use audio but the towers ship with the architecture
    • Multimodal projector (multi_modal_projector.*)
    • Output head (lm_head) and input embeddings (embed_tokens) — tied per tie_word_embeddings: true
    • Gemma 4-specific per_layer_input_gate / per_layer_projection
    • All normalization layers
  • Quality impact: small (gated by competition's >=80% threshold)

Why text-decoder only?

vLLM's Gemma 4 model code instantiates the multimodal-tower linears as standard nn.Linear, not as quantized linears. If those weights are pre-packed on disk (as .weight_packed / .weight_scale), vLLM's parameter loader cannot bind them and crashes at load time. Restricting quantization to the text decoder — where the 42 decoder layers dominate both the parameter count and the energy budget — preserves vLLM compatibility while still capturing the bulk of the FP8 energy savings.

2. FP8 KV Cache

  • Setting: kv_cache_dtype: fp8
  • Effect: Reduces KV cache memory by ~50%, freeing GPU memory for computation
  • Quality impact: Negligible
  • Energy reduction: ~15% due to reduced memory bandwidth pressure

3. Reduced Context Window

  • Setting: max_model_len: 4096 (vs. model's native 131K)
  • Rationale: Image-to-text tasks use <2K tokens. Reducing to 4096 minimizes pre-allocated KV cache, improving GPU utilization.

4. CUDA Graphs (enabled by default)

  • enforce_eager NOT set — CUDA graphs enabled by default
  • Eliminates Python scheduling overhead in decode, 15-30% faster inference

5. Chunked Prefill + Prefix Caching

  • Chunked prefill: Better GPU utilization during image+text prefill
  • Prefix caching: Avoids redundant computation for shared prompts

Serving

vllm serve MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized --config vllm_config.yaml

vLLM Configuration

model: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
tokenizer: MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized
dtype: bfloat16
max_model_len: 4096
gpu_memory_utilization: 0.90
kv_cache_dtype: fp8
limit_mm_per_prompt:
  image: 1
enable_chunked_prefill: true
enable_prefix_caching: true
max_num_seqs: 32
disable_log_requests: true

Docker Deployment (Lightning AI — Tested Command)

This is the exact Docker command used to load and test this checkpoint on Lightning AI (1x NVIDIA L4).

Step 1: Initialize MODEL_DIR

First, set the path to a local directory containing this checkpoint's files (or clone/download this repo):

export MODEL_DIR=/path/to/gemma4-e4b-it-mps-optimized

Example: if you cloned this repo to ~/models/, use:

export MODEL_DIR=~/models/gemma4-e4b-it-mps-optimized

Step 2: Run the Docker container

docker run --rm --gpus all --ipc=host -p 8000:8000 \
    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
    -v "$MODEL_DIR:/model" \
    vllm/vllm-openai:v0.23.0-cu129 \
    /model \
    --tokenizer /model \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --kv-cache-dtype fp8 \
    --limit-mm-per-prompt '{"image":1}' \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --served-model-name gemma4-mps

Flag reference:

  • -e VLLM_TEST_FORCE_FP8_MARLIN=1 = Force FP8 Marlin kernel selection (required for this checkpoint on L4)
  • -v "$MODEL_DIR:/model" = Mount local model directory to /model inside container (must be absolute path)
  • --dtype bfloat16 = Activations and compute in bfloat16 (quantization_config in config.json handles FP8 weight loading automatically)
  • --kv-cache-dtype fp8 = Keeps KV cache in FP8 for memory efficiency
  • --max-model-len 4096 = Matches the vllm_config.yaml setting
  • --enable-chunked-prefill / --enable-prefix-caching = Same performance optimizations as config file

This is equivalent to vllm serve ... --config vllm_config.yaml above — the Docker form passes flags directly on the CLI instead of via config file, and points to a local model directory instead of the HF repo ID.

Competition Sampling Parameters

Applied per-request by the evaluation harness:

  • temperature: 1.0
  • top_p: 0.95
  • top_k: 64

Expected Performance

Metric Baseline (bf16) FP8 Optimized Change
Model size on disk ~15.3 GB ~11 GB (text decoder FP8) ~-25%
Inference speed Reference ~1.5-2x faster FP8 tensor cores + CUDA graphs
Energy Reference ~30-45% less Significant reduction
Quality Reference Passes 80% quality gate Validated on calibration

Energy Optimization Strategy

The competition ranks by total energy consumed over the benchmark suite:

  1. FP8 text-decoder weights (compressed-tensors) → FP8 tensor cores on L4 give large GEMM throughput gains where it matters most (the 42 decoder layers dominate the FLOPs budget) = faster = less energy
  2. FP8 KV cache → Halves cache memory traffic = less energy for attention
  3. CUDA graphs → Eliminates Python overhead = faster decode = less time on GPU
  4. Chunked prefill → Better GPU utilization during image processing
  5. Prefix caching → Avoids redundant computation for repeated prompts
  6. Reduced max_model_len (4096) → Less pre-allocated memory = more efficient GPU utilization
  7. Disabled request logging → Reduces I/O overhead during evaluation

Best Practices

For optimal performance, use these configurations:

Sampling Parameters

Use the standardized sampling configuration (applied by the evaluation harness):

  • temperature=1.0
  • top_p=0.95
  • top_k=64

Thinking Mode

  • Trigger Thinking: Include <|think|> token at the start of the system prompt
  • Disable Thinking: Remove the token; the model will generate empty thought blocks
  • Multi-Turn: In multi-turn conversations, do NOT include thinking content from previous turns

Multimodal Input Order

For optimal performance:

  • Place image content before the text in your prompt
  • Audio content (if applicable) goes after the text

Variable Image Resolution

Gemma 4 supports variable image resolution through a configurable visual token budget:

  • Supported budgets: 70, 140, 280, 560, 1120
  • Lower budgets for classification/captioning (faster inference)
  • Higher budgets for OCR, document parsing, reading small text

Limitations

  • Models generate responses based on training data patterns — they may produce incorrect or outdated factual statements
  • Open-ended or highly complex tasks might be challenging
  • Natural language ambiguity (sarcasm, figurative language) can be difficult
  • Performance influenced by amount of context provided

Who We Are

Two engineers from Bucharest, Romania — not a typical ML research team. We're enterprise engineers who work with large, complex systems for a living and decided to take on an AI compression challenge.

Team: Mihai Peti & Sonia Frumuseanu
HuggingFace: mihaipeti2009 & frumuseanus

Development Environment

All development and testing was done on Lightning AI:

Component Spec
GPU NVIDIA L4 Tensor Core
VRAM 24 GB
vCPUs 8
RAM 32 GB
TFLOPs (BF16/FP16) 121
TOPS (INT8) 242.5
TOPS (INT4) 485

This matches the competition's evaluation hardware (1x NVIDIA L4).

License

This model is distributed under the Gemma Terms of Use, consistent with the original google/gemma-4-E4B-it model license.

Acknowledgments

  • Google DeepMind for the Gemma 4 model family
  • The Resilient AI Challenge organizers (France, India, UNESCO, Sustainable AI Coalition)
  • Lightning AI for GPU compute resources
Downloads last month
74
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MPSAIResilienceChallenge/gemma4-e4b-it-mps-optimized

Quantized
(251)
this model