You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gemma-4 E4B-it · runtime BitsAndBytes NF4 (Round 1 artifact — retrospective)

Team Godspeed AI's Round 1 submission to the Resilient AI Challenge 2026 (image-to-text). Preserved unchanged as a historical artifact and a documented negative result. Do not use this approach if your goal is inference energy efficiency — read on.

What this is

google/gemma-4-E4B-it with weights stored in bf16 and quantized to 4-bit NF4 at load time by vLLM's BitsAndBytes integration (load-format: bitsandbytes, quantization: bitsandbytes). No weights were modified; compression happens entirely at runtime.

Official Round 1 results (organizer-measured, NVIDIA L4)

Model Energy (J) Doc analysis Image understanding Mean recovery
BF16 base 99.71 0.7608 0.68 100%
This artifact 113.41 (+13.7%) 0.7576 0.56 (−17.6%) 91.45%

The compressed model used more energy than the uncompressed base.

Why — the lesson this repo exists to teach

  1. Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
  2. NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.

Our Round 2 artifacts fix both: g4e4-it-r2-awq-smoke-v0 (primary — AWQ-Marlin full decoder + response-economy chat template, ~4–5× less energy than this repo at higher recovery) and g4e4-it-r2-w4a16-mlpo-v0 (GPTQ-Marlin over MLP and attention-output projections — the conservative alternative).

Usage (reproduction only)

vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml

Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64 (also in generation_config.json).

License

Apache 2.0, inherited from google/gemma-4-E4B-it.

Downloads last month
38
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shankara-A-S/g4e4-it-v0

Finetuned
(221)
this model