Instructions to use toxzak/gemma4-e2b-exp-quant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toxzak/gemma4-e2b-exp-quant with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="toxzak/gemma4-e2b-exp-quant")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("toxzak/gemma4-e2b-exp-quant", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use toxzak/gemma4-e2b-exp-quant with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "toxzak/gemma4-e2b-exp-quant" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "toxzak/gemma4-e2b-exp-quant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/toxzak/gemma4-e2b-exp-quant
- SGLang
How to use toxzak/gemma4-e2b-exp-quant with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "toxzak/gemma4-e2b-exp-quant" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "toxzak/gemma4-e2b-exp-quant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "toxzak/gemma4-e2b-exp-quant" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "toxzak/gemma4-e2b-exp-quant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use toxzak/gemma4-e2b-exp-quant with Docker Model Runner:
docker model run hf.co/toxzak/gemma4-e2b-exp-quant
sub1quant mixed-budget Gemma 4 E2B artifacts
This repository contains the mixed-budget sub-4-bit artifact from sub1quant.
The base model is not mirrored here; download google/gemma-4-E2B separately.
Current artifact
| File | Method | Avg BPW | Size |
|---|---|---|---|
quantized/gemma_mixed_budget_full_g128_target4p0.pt |
mixed budget, g128, target 4.0 BPW | 3.9990 | 948 MB |
The checkpoint contains 316 language-model weight tensors:
| Format | Count |
|---|---|
| Groupwise INT4 | 301 |
| INT2 + binary residual | 14 |
| INT2 + error-budget k4 side channel | 1 |
Live Colab evaluation
Run date: 2026-06-29
Hardware/runtime: NVIDIA L4, CUDA, dense BF16 evaluation after applying the quantized weights.
| Run | Runtime dtype | WikiText tokens | Chunks | PPL |
|---|---|---|---|---|
Unquantized google/gemma-4-E2B base |
BF16 | 292,282 | 571 | 108.4542 |
| Mixed budget full g128 target 4.0 | BF16 dense eval after applying quantized weights | 292,282 | 571 | 107.5656 |
This supports a narrow claim: BF16-baseline-equivalent perplexity on this exact Gemma4/WikiText/Colab runner at about 4.00 BPW. It is not an FP16 result, not an FP8 comparison, and not a throughput result. The current evaluator reconstructs/applies weights into a normal dense model for correctness.
Result files:
eval_results/mixed_budget_full_g128_target4p0_ppl_live.jsoneval_results/base_full_ppl_live.jsoneval_results/mixed_budget_live_colab_comparison.jsoneval_results/mixed_budget_scan_full_g128_target4p0.json
Reproduce
pip install "transformers>=5.5.0" torch accelerate safetensors huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download('google/gemma-4-E2B', local_dir='./models/gemma-4-E2B')"
python scripts/limited_ppl_bench.py \
--label mixed_budget_full_g128_target4p0 \
--model-dir models/gemma-4-E2B \
--wikitext data/wiki.test.txt \
--quantized-pt quantized/gemma_mixed_budget_full_g128_target4p0.pt \
--tokens 1000000000 \
--max-length 512 \
--stride 512 \
--device cuda \
--output eval_results/mixed_budget_full_g128_target4p0_ppl_live.json
License
The quantization code and metadata in this repository are Apache-2.0. The base model remains governed by Google's Gemma license.
Model tree for toxzak/gemma4-e2b-exp-quant
Base model
google/gemma-4-E2B