You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

gemma-4-E4B-it · pruned 20% → distilled → W4A16

A compute-and-energy-optimized google/gemma-4-E4B-it built by a three-stage pipeline:

  1. Structural MLP prune (−20%) — the LM-stack gate_proj/up_proj/down_proj intermediate dimension is reduced 20% with a calibrated importance criterion. Vision/audio towers and attention are untouched.
  2. Knowledge distillation — the pruned LM is recovered toward the unpruned bf16 teacher (Phase 1 forward-KL + hidden-state matching, Phase 2 on-policy GKD/JSD for brevity). Vision/audio towers frozen. This restores both capability and output brevity to approximate teacher level.
  3. W4A16 quantization — int4 weight-only quantization via llmcompressor oneshot + QuantizationModifier (observer-only; no Hessian/AWQ). Activations stay bf16.

Saved in compressed-tensors pack-quantized format — loads in HF Transformers (Marlin / GPTQ-Marlin kernels, run_compressed=True) and in vLLM via the CompressedTensorsWNA16 loader.

The checkpoint's quant_recipe.json carries the full base → prune → distill → quant source_lineage together with the calibration datasets used in each step.

Quantization recipe

QuantizationModifier(
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": True,
                "strategy": "group",
                "group_size": 128,
                "observer": "minmax",
                "actorder": None,
                "dynamic": False,
            },
            "input_activations": None,
            "output_activations": None,
        }
    },
    ignore=[
        "re:.*vision_tower.*",          # ViT encoder + patch embedder
        "re:.*audio_tower.*",           # audio layers + subsample + output_proj
        "re:.*per_layer_input_gate.*",  # PLE input gates
        "re:.*per_layer_projection.*",  # PLE projections
        "re:.*embed_vision.*",          # vision embedding_projection
        "re:.*embed_audio.*",           # audio embedding_projection
        "lm_head",
    ],
)

Only LM-stack Linear weights are packed to int4. The vision tower, audio tower, Per-Layer Embedding (PLE) plumbing, vision/audio projectors, and lm_head stay bf16.

Inference

No quantization= argument — vLLM auto-detects compressed-tensors from config.json and binds to MarlinLinearKernel.

To serve:

vllm serve terra-cognita-ai/ResAI_Image-to-Text_final --config vllm_config.yaml

The vllm_config.yaml is included in the root directory of the model.

License

Inherits the Gemma License from the base model. By using this checkpoint you agree to the Gemma Terms of Use.

Acknowledgements

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for terra-cognita-ai/ResAI_Image-to-Text_final

Quantized
(233)
this model