gemma-4-E2B-it — Selective 4-bit NF4 (v2)
Selectively quantized version of google/gemma-4-E2B-it.
Only the LLM backbone is quantized. Audio and vision encoders remain at full bfloat16 precision.
Precision per component
| Component | Precision |
|---|---|
audio_tower |
bfloat16 (full precision) |
vision_tower |
bfloat16 (full precision) |
language_model |
4-bit NF4 |
Why selective quantization?
Standard BitsAndBytesConfig with device_map="auto" quantizes ALL nn.Linear layers across the entire model graph, including the audio and vision encoders. This v2 uses a targeted approach — only model.language_model linear layers are replaced with Linear4bit, preserving full encoder quality.
Quantization details
quant_type: nf4compute_dtype: bfloat16- Method: direct
bitsandbytes.nn.Linear4bitreplacement onlanguage_modelonly
Usage
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model = AutoModelForImageTextToText.from_pretrained(
"derkar00/gemma-4-E2B-it-4bit-nf4-v2",
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained("derkar00/gemma-4-E2B-it-4bit-nf4-v2")
Comparison to v1
See derkar00/gemma-4-E2B-it-4bit-nf4 for v1 where all components including encoders were quantized.
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support