gemma-4-E2B-it — Selective 4-bit NF4 (v2)

Selectively quantized version of google/gemma-4-E2B-it.

Only the LLM backbone is quantized. Audio and vision encoders remain at full bfloat16 precision.

Precision per component

Component	Precision
`audio_tower`	bfloat16 (full precision)
`vision_tower`	bfloat16 (full precision)
`language_model`	4-bit NF4

Why selective quantization?

Standard BitsAndBytesConfig with device_map="auto" quantizes ALL nn.Linear layers across the entire model graph, including the audio and vision encoders. This v2 uses a targeted approach — only model.language_model linear layers are replaced with Linear4bit, preserving full encoder quality.

Quantization details

quant_type: nf4
compute_dtype: bfloat16
Method: direct bitsandbytes.nn.Linear4bit replacement on language_model only

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "derkar00/gemma-4-E2B-it-4bit-nf4-v2",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained("derkar00/gemma-4-E2B-it-4bit-nf4-v2")

Comparison to v1

See derkar00/gemma-4-E2B-it-4bit-nf4 for v1 where all components including encoders were quantized.

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for derkar00/gemma-4-E2B-it-4bit-nf4-v2

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(245)

this model