Instructions to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E2B-TurboQuant-AWQ-4bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("majentik/gemma-4-E2B-TurboQuant-AWQ-4bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit

SGLang

How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with Docker Model Runner:
```
docker model run hf.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit
```

Gemma 4 E2B - TurboQuant AWQ 4-bit

4-bit AWQ-quantized version of google/gemma-4-E2B with TurboQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference, preserving the salient weights most important to model outputs while aggressively quantizing the rest. Designed for efficient deployment via AutoAWQ and vLLM on CUDA GPUs.

Approximate model size: ~1.5 GB

Model Specifications

Property	Value
Base Model	google/gemma-4-E2B
Parameters	~2 billion
Architecture	Dense transformer
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	AWQ 4-bit (~1.5 GB)
Group Size	128
KV-Cache Quantization	TurboQuant
Framework	transformers + AutoAWQ / vLLM

Quickstart

AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-TurboQuant-AWQ-4bit")

prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM

vllm serve majentik/gemma-4-E2B-TurboQuant-AWQ-4bit \
  --quantization awq_marlin \
  --max-model-len 8192

Python vLLM client

from vllm import LLM, SamplingParams

llm = LLM(
    model="majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
    quantization="awq_marlin",
)
params = SamplingParams(temperature=0.7, max_tokens=512)
print(llm.generate(["Explain AWQ quantization."], params)[0].outputs[0].text)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 4-bit AWQ weight quantization, this provides a dual compression strategy: smaller model weights for reduced VRAM footprint, plus compressed KV cache for efficient long-context generation on GPU.

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

AWQ vs GGUF vs MLX

Format	Target Hardware	Runtime	Best For
AWQ	NVIDIA / AMD GPU (CUDA/ROCm)	AutoAWQ, vLLM, TGI	GPU-native inference, production serving
GGUF	CPU + GPU (cross-platform)	llama.cpp, Ollama, LM Studio	Laptops, CPU-only boxes, mixed offload
MLX	Apple Silicon	MLX, mlx-lm, mlx-vlm	Macs with unified memory

This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

Memory Estimates (Gemma 4 E2B)

Precision	Approximate Size	VRAM Tier
FP16 (original)	~4 GB	8 GB+
AWQ 8-bit	~2 GB	4 GB+
AWQ 4-bit	~1.5 GB	4 GB+

Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).

Hardware Requirements

NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
CUDA 12.x recommended
For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels

Quant trade-off (AWQ lane)

Bits	Approx size	Use case	Recommendation
4-bit	~860 MB	Activation-aware 4-bit weight quant	GPU inference (vLLM, transformers, AutoAWQ)
8-bit	~1.5 GB	Activation-aware 8-bit weight quant	Quality-sensitive GPU inference

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 18 sibling variants under majentik/gemma4-e2b-*. The current variant — TurboQuant-AWQ-4bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-AWQ-4bit	transformers	~1.2 GB	GPU 4-bit (AutoAWQ)
RotorQuant-AWQ-8bit	transformers	~2.2 GB	GPU 8-bit (AutoAWQ)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~1.7 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~1.2 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~1.6 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~2.2 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~2.6 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~4.2 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~655 MB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~1.2 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~2.4 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-AWQ-4bit	transformers	~1.2 GB	GPU 4-bit (AutoAWQ)
TurboQuant-AWQ-8bit	transformers	~2.2 GB	GPU 8-bit (AutoAWQ)
TurboQuant-MLX-2bit	mlx-lm	~655 MB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~1.2 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~2.4 GB	Apple Silicon reference

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-E2B-TurboQuant-AWQ-4bit

Base model

google/gemma-4-E2B

Finetuned

(62)

this model

Paper for majentik/gemma-4-E2B-TurboQuant-AWQ-4bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34