Instructions to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E2B-TurboQuant-AWQ-4bit")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/gemma-4-E2B-TurboQuant-AWQ-4bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit
- SGLang
How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-TurboQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use majentik/gemma-4-E2B-TurboQuant-AWQ-4bit with Docker Model Runner:
docker model run hf.co/majentik/gemma-4-E2B-TurboQuant-AWQ-4bit
Gemma 4 E2B - TurboQuant AWQ 4-bit
4-bit AWQ-quantized version of google/gemma-4-E2B with TurboQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference, preserving the salient weights most important to model outputs while aggressively quantizing the rest. Designed for efficient deployment via AutoAWQ and vLLM on CUDA GPUs.
Approximate model size: ~1.5 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E2B |
| Parameters | ~2 billion |
| Architecture | Dense transformer |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 4-bit (~1.5 GB) |
| Group Size | 128 |
| KV-Cache Quantization | TurboQuant |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-TurboQuant-AWQ-4bit")
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-E2B-TurboQuant-AWQ-4bit \
--quantization awq_marlin \
--max-model-len 8192
Python vLLM client
from vllm import LLM, SamplingParams
llm = LLM(
model="majentik/gemma-4-E2B-TurboQuant-AWQ-4bit",
quantization="awq_marlin",
)
params = SamplingParams(temperature=0.7, max_tokens=512)
print(llm.generate(["Explain AWQ quantization."], params)[0].outputs[0].text)
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 4-bit AWQ weight quantization, this provides a dual compression strategy: smaller model weights for reduced VRAM footprint, plus compressed KV cache for efficient long-context generation on GPU.
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 E2B)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~4 GB | 8 GB+ |
| AWQ 8-bit | ~2 GB | 4 GB+ |
| AWQ 4-bit | ~1.5 GB | 4 GB+ |
Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).
Hardware Requirements
- NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
See Also
- google/gemma-4-E2B -- Base model
- majentik/gemma-4-E2B-TurboQuant -- TurboQuant KV-cache only (transformers)
- majentik/gemma-4-E2B-TurboQuant-AWQ-8bit -- AWQ 8-bit variant
- majentik/gemma-4-E2B-RotorQuant-AWQ-4bit -- RotorQuant AWQ 4-bit variant
- majentik/gemma-4-E2B-TurboQuant-MLX-4bit -- MLX variant (Apple Silicon)
- TurboQuant Paper (arXiv: 2504.19874)
- AutoAWQ
- vLLM
Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~860 MB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| 8-bit | ~1.5 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — 4bit — is bolded.)
Variants in this family
(Showing 18 sibling variants under majentik/gemma4-e2b-*. The current variant — TurboQuant-AWQ-4bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~1.7 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~1.2 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~1.6 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~2.2 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~2.6 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~4.2 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~655 MB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~2.4 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~655 MB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~2.4 GB | Apple Silicon reference |
Model tree for majentik/gemma-4-E2B-TurboQuant-AWQ-4bit
Base model
google/gemma-4-E2B