Access to Mathos34400/resilient-challenge-image-to-text

This repository contains a compressed Gemma-4-E4B-it model submitted for the Resilient AI Challenge 2026 (image-to-text category). Access is granted manually. The original Gemma license applies.

By requesting access you confirm that you accept the Gemma license terms (https://ai.google.dev/gemma/terms) and that you will use this model in compliance with it.

Log in or Sign Up to review the conditions and access this model content.

Gemma-4-E4B-it — Q4_K_M GGUF + Q8_0 mmproj

Submission for the Resilient AI Challenge 2026 — image-to-text category.

Quantized multimodal version of Google's Gemma-4-E4B-it, packaged for inference on a single NVIDIA L4 (24 GB).

Runtime: llama.cpp (llama-server). Image-to-text inference requires llama.cpp with the included multimodal projector (mmproj). vLLM's GGUF backend does not currently support multimodal Gemma 4 inference, so vLLM cannot be used for image input on this submission.

Files

File Role
gemma-4-q4_k_m.gguf Language model — Q4_K_M K-quant (4-bit, imatrix-calibrated)
mmproj-gemma-4-E4B-it-Q8_0.gguf Multimodal projector (vision encoder + projection) — Q8_0
vllm_config.yaml vLLM config (text-only path; kept for completeness, not viable for image input)
config.json, processor_config.json, generation_config.json, tokenizer*, chat_template.jinja HF configs / tokenizer / chat template

Compression

The compression is fully llama.cpp-based and uses importance-matrix-guided 4-bit quantization to preserve quality at low bit-width:

  1. F16 GGUF conversion. The original Gemma-4-E4B-it checkpoint is converted to a full-precision GGUF in F16 (convert_hf_to_gguf.py).
  2. Importance-matrix (imatrix) computation. An imatrix is computed from a calibration dataset with llama.cpp's imatrix tool.
  3. Imatrix-guided Q4_K_M quantization. The F16 GGUF is quantized to Q4_K_M with llama-quantize, passing the imatrix file so the K-quant mix uses the importance information.

The vision projector is shipped as Q8_0 — quantizing the small projector to 8-bit instead of carrying the BF16 file (~990 MB) saves bandwidth and VRAM without measurable quality loss.

Inference — llama-server (required for image input)

llama-server \
  -m gemma-4-q4_k_m.gguf \
  --mmproj mmproj-gemma-4-E4B-it-Q8_0.gguf

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint, uses the included chat_template.jinja automatically, and accepts images via the standard image_url content blocks.

The equivalent CLI for local testing is llama-mtmd-cli:

llama-mtmd-cli \
  -m gemma-4-q4_k_m.gguf \
  --mmproj mmproj-gemma-4-E4B-it-Q8_0.gguf \
  --image /path/to/image.jpg \
  -p "Describe the image."

vLLM (text-only — not used for evaluation)

A vllm_config.yaml is provided at the repo root and would be picked up by vllm serve …. However, vLLM's GGUF backend does not support multimodal Gemma 4: it can load the text model but cannot consume images. The image-to-text task therefore runs only under llama.cpp for this submission.

Hardware target

  • GPU: NVIDIA L4 (24 GB), single GPU.
  • Runtime: latest llama.cpp / llama-server (with multimodal --mmproj support, available since the Gemma 4 vision PR in llama.cpp).

License

Released under the Gemma license (https://ai.google.dev/gemma/terms), the same license as the base model.

Downloads last month
39
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mathos34400/resilient-challenge-image-to-text

Quantized
(196)
this model