Gemma 4 E2B - TurboQuant AWQ 4-bit

4-bit AWQ-quantized version of google/gemma-4-E2B with TurboQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference, preserving the salient weights most important to model outputs while aggressively quantizing the rest. Designed for efficient deployment via AutoAWQ and vLLM on CUDA GPUs.

Approximate model size: ~1.5 GB

Model Specifications

Property Value
Base Model google/gemma-4-E2B
Parameters ~2 billion
Architecture Dense transformer
Modality Multimodal: image + text input, text output
License Apache 2.0
Weight Quantization AWQ 4-bit (~1.5 GB)
Group Size 128
KV-Cache Quantization TurboQuant
Framework transformers + AutoAWQ / vLLM

Quickstart

AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-TurboQuant-AWQ-4bit")

prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM

vllm serve majentik/gemma-4-E2B-TurboQuant-AWQ-4bit \
  --quantization awq_marlin \
  --max-model-len 8192

Python vLLM client

from vllm import LLM, SamplingParams

llm = LLM(
    model="majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
    quantization="awq_marlin",
)
params = SamplingParams(temperature=0.7, max_tokens=512)
print(llm.generate(["Explain AWQ quantization."], params)[0].outputs[0].text)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 4-bit AWQ weight quantization, this provides a dual compression strategy: smaller model weights for reduced VRAM footprint, plus compressed KV cache for efficient long-context generation on GPU.

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

AWQ vs GGUF vs MLX

Format Target Hardware Runtime Best For
AWQ NVIDIA / AMD GPU (CUDA/ROCm) AutoAWQ, vLLM, TGI GPU-native inference, production serving
GGUF CPU + GPU (cross-platform) llama.cpp, Ollama, LM Studio Laptops, CPU-only boxes, mixed offload
MLX Apple Silicon MLX, mlx-lm, mlx-vlm Macs with unified memory

This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

Memory Estimates (Gemma 4 E2B)

Precision Approximate Size VRAM Tier
FP16 (original) ~4 GB 8 GB+
AWQ 8-bit ~2 GB 4 GB+
AWQ 4-bit ~1.5 GB 4 GB+

Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).

Hardware Requirements

  • NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
  • CUDA 12.x recommended
  • For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels

See Also

Quant trade-off (AWQ lane)

Bits Approx size Use case Recommendation
4-bit ~860 MB Activation-aware 4-bit weight quant GPU inference (vLLM, transformers, AutoAWQ)
8-bit ~1.5 GB Activation-aware 8-bit weight quant Quality-sensitive GPU inference

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 18 sibling variants under majentik/gemma4-e2b-*. The current variant — TurboQuant-AWQ-4bit — is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-AWQ-4bit transformers ~1.2 GB GPU 4-bit (AutoAWQ)
RotorQuant-AWQ-8bit transformers ~2.2 GB GPU 8-bit (AutoAWQ)
RotorQuant-GGUF-IQ4_XS llama.cpp ~1.7 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K llama.cpp ~1.2 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~1.6 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~2.2 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~2.6 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~4.2 GB Near-lossless reference
RotorQuant-MLX-2bit mlx-lm ~655 MB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~1.2 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~2.4 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-AWQ-4bit transformers ~1.2 GB GPU 4-bit (AutoAWQ)
TurboQuant-AWQ-8bit transformers ~2.2 GB GPU 8-bit (AutoAWQ)
TurboQuant-MLX-2bit mlx-lm ~655 MB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~1.2 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~2.4 GB Apple Silicon reference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-E2B-TurboQuant-AWQ-4bit

Finetuned
(62)
this model

Paper for majentik/gemma-4-E2B-TurboQuant-AWQ-4bit