You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

gemma-4-E4B-it · W4A16 (llmcompressor, observer-only)

Int4 weight-only quantization of google/gemma-4-E4B-it, produced offline with llmcompressor oneshot + QuantizationModifier.

Only LM-stack Linear weights are packed to int4. Vision tower, audio tower, the Per-Layer Embedding (PLE) plumbing, vision/audio projectors, and lm_head are kept at bf16.

Saved in compressed-tensors pack-quantized format. Loads in vLLM via the CompressedTensorsWNA16 loader bound to MarlinLinearKernel.

Weights footprint at load: ≈9.5 GiB (vs ~16 GiB bf16 baseline, −41%).

Why observer-only

Gemma-4 E-variants combine Per-Layer Embeddings (per_layer_input_gate, per_layer_projection) with KV-sharing across decoder layers. Both GPTQModifier and AWQModifier rely on the llmcompressor sequential pipeline, which calls torch.fx.symbolic_trace on the model. The PLE + KV-sharing topology trips torch.fx.proxy.TraceError and the run aborts with no clean recovery.

QuantizationModifier skips the sequential pipeline entirely: it computes observer-only scales from calibration activations and quantizes weights in place. No Hessian, no AWQ smoothing — just statistics from the calibration forwards. This is a working PTQ path on Gemma-4 E.

What is quantized

259 Linear modules across the language stack:

  • 42 × (q_proj, o_proj, gate_proj, up_proj, down_proj) = 210
  • 24 × (k_proj, v_proj) (KV-sharing collapses 18/42 layers) = 48
  • 1 × model.language_model.per_layer_model_projection = 1

Everything in the ignore list above stays bf16.

Calibration

Field Value
Dataset garage-bAInd/Open-Platypus
Split train
Samples 256
Max seq length 2048 tokens
Chat template applied (single-turn user message per row)
Modality text only
Seed 42

Inference

vLLM

No quantization= argument needed — vLLM's compressed-tensors loader auto-detects from config.json and binds to MarlinLinearKernel:

from vllm import LLM
llm = LLM(model="<this repo>", dtype="bfloat16")

Reproduce

A quant_recipe.json is written alongside the safetensors with the git SHA, full scheme dict, ignore patterns, and calibration block — useful for reproducibility audits.

License

Inherits the Gemma License from the base model. By using this checkpoint you agree to the Gemma Terms of Use.

Acknowledgements

Downloads last month
62
Safetensors
Model size
8B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for terra-cognita-ai/ResAI_Image-to-Text_Round-1

Quantized
(201)
this model