Instructions to use mikeytag/gemma-4-E2B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mikeytag/gemma-4-E2B-it-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mikeytag/gemma-4-E2B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("mikeytag/gemma-4-E2B-it-NVFP4") model = AutoModelForImageTextToText.from_pretrained("mikeytag/gemma-4-E2B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mikeytag/gemma-4-E2B-it-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mikeytag/gemma-4-E2B-it-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mikeytag/gemma-4-E2B-it-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mikeytag/gemma-4-E2B-it-NVFP4
- SGLang
How to use mikeytag/gemma-4-E2B-it-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mikeytag/gemma-4-E2B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mikeytag/gemma-4-E2B-it-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mikeytag/gemma-4-E2B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mikeytag/gemma-4-E2B-it-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mikeytag/gemma-4-E2B-it-NVFP4 with Docker Model Runner:
docker model run hf.co/mikeytag/gemma-4-E2B-it-NVFP4
gemma-4-E2B-it-NVFP4 (draft, text path validated)
Why
I've been trying to get as much performance as possible out of a single DGX Spark with Gemma 4 models. In my local setup, I can run LilaRest/gemma-4-31B-it-NVFP4-turbo at about 8-9 tok/sec on an OpenClaw benchmark suite I built for the Spark.
My hope was that if the Gemma 4 E2B model was accurate enough, I could use it as a draft model and materially improve generation speed when paired with the 31B model. After getting the E2B draft + 31B target setup running in memory on the Spark with a 128k context window, I got a perfect score on my benchmark suite and about 18-20 tok/sec. That was better than I expected, so I'm sharing it here for others to test and build on.
This release is focused on the text serving path used in that validated setup.
What this is
This repository contains an NVFP4-quantized draft/speculative model derived directly from:
- Base model: google/gemma-4-E2B-it
It was prepared to be used as a speculative decoding draft model with:
- Validated target model: LilaRest/gemma-4-31B-it-NVFP4-turbo
Important distinction:
google/gemma-4-E2B-itis the source/base modelLilaRest/gemma-4-31B-it-NVFP4-turbois the validated pairing/target model- This repository is not derived from the LilaRest 31B model
Status
- ✅ Validated for the text serving path with the 31B NVFP4 target model above
- ✅ Validated on DGX Spark in the known-good 128k setup
- ⚠️ Vision support is not included in this release path yet
- ⚠️ This has not been broadly validated across many hardware/software stacks
Intended pairing (important)
Use this draft model specifically with:
LilaRest/gemma-4-31B-it-NVFP4-turbo
Using a different target model is untested and may fail or regress.
Quantization
This model uses ModelOpt NVFP4 quantization.
Authoritative quantization metadata is in config.json:
quant_algo: NVFP4quant_method: modelopt
About the Hugging Face Safetensors "Tensor type" display
If the Hugging Face UI shows tensor storage types such as:
BF16F8_E4M3U8
that does not mean this is a BF16 model instead of an NVFP4 model.
That display reflects the underlying stored tensor dtypes and auxiliary quantization data present in the checkpoint files. The high-level quantization scheme for this repository is still NVFP4, as defined in quantization_config.
Benchmark summary (validated setup)
From local openclaw-bench runs:
Speculative draft setup - 31B target + this E2B draft, 128k context, single-sequence bring-up:
- Aggregate score: 1.00
- Checks: 66/66 passed
- Avg throughput: 18.05 tok/s
- Avg latency: 3465 ms
Non-spec baseline runs of 31B alone:
- Earlier plain baseline run: 8.79 tok/s, 19871 ms
- Upstream Gemma 4 tool template + Gemma 4 reasoning parser, thinking disabled: 9.02 tok/s, 19359 ms
The stronger 31B-only baseline above used:
LilaRest/gemma-4-31B-it-NVFP4-turbomax-model-len 131072max-num-seqs 4gpu-memory-utilization 0.80kv-cache-dtype fp8- prefix caching enabled
- auto tool choice enabled
tool-call-parser gemma4reasoning-parser gemma4- upstream Gemma 4 chat template with thinking disabled
Observed result in this environment: roughly 2x throughput vs the straight 31B baseline.
Exact vLLM launch approach
This is the exact pattern I used to get the E2B + 31B combo up and running.
First, download the Gemma 4 tool/chat template used in validation:
mkdir -p templates
curl -L https://huggingface.co/mikeytag/gemma-4-E2B-it-NVFP4/resolve/main/templates/gemma4-tool-upstream.jinja \
-o templates/gemma4-tool-upstream.jinja
Then launch vLLM:
docker run -d --gpus all \
--name lilarest-nvfp4-vllm \
-p 8000:8000 \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$(pwd)/templates:/templates:ro" \
--entrypoint /bin/bash \
vllm/vllm-openai:cu130-nightly -lc '
set -euo pipefail
pip install -q "transformers>=5.5.0"
python3 - <<"PY"
from pathlib import Path
p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py")
text = p.read_text()
old = "except AssertionError:\n return False"
new = "except (AssertionError, ValueError):\n return False"
if old in text and new not in text:
p.write_text(text.replace(old, new, 1))
p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py")
text = p.read_text()
needle = " tgt_kv_cache_group.layer_names.append(layer_name)\n\n if runner_only_attn_layers is not None:\n"
insert = """ tgt_kv_cache_group.layer_names.append(layer_name)\n\n if isinstance(tgt_kv_cache_group.kv_cache_spec, UniformTypeKVCacheSpecs):\n tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[layer_name] = (\n tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[target_layer_name]\n )\n\n if runner_only_attn_layers is not None:\n"""
if needle in text and "kv_cache_specs[layer_name]" not in text:
p.write_text(text.replace(needle, insert, 1))
PY
SPEC_JSON=$(python3 - <<"PY"
import json
print(json.dumps({
"model": "mikeytag/gemma-4-E2B-it-NVFP4",
"num_speculative_tokens": 4,
"draft_tensor_parallel_size": 1,
"quantization": "modelopt"
}))
PY
)
EXTRA_ARGS=()
if [[ -f /templates/gemma4-tool-upstream.jinja ]]; then
EXTRA_ARGS+=(
--chat-template-content-format openai
--default-chat-template-kwargs "{\"enable_thinking\": false}"
--chat-template /templates/gemma4-tool-upstream.jinja
)
fi
exec vllm serve "LilaRest/gemma-4-31B-it-NVFP4-turbo" \
--host 0.0.0.0 \
--port 8000 \
--quantization modelopt \
--max-model-len 131072 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--disable-hybrid-kv-cache-manager \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
"${EXTRA_ARGS[@]}" \
--speculative-config "$SPEC_JSON"
'
Caveats
- Runtime currently depends on two hotfixes applied to vLLM at startup:
- catch
ValueErrorinvllm/v1/core/kv_cache_utils.py - copy KV spec entries for KV-sharing layers in
vllm/v1/worker/utils.py
- catch
- Requires
--quantization modeloptand matching draft speculative config draft_tensor_parallel_sizeis set to1in the validated configuration- This release is focused on the text inference path
- If you omit the Gemma 4 chat template arguments above, you may see raw markers like
<|channel|>thoughtleak into responses
Limitations
- Not evaluated as a standalone chat model
- Pairing is intentionally narrow and validated only with the 31B target model listed above
- Validation evidence is from local DGX Spark runs
- Portability to other hardware/software stacks is not guaranteed
Roadmap
- Publish a vision-capable draft path once validated
- Remove or replace the runtime hotfix dependency with upstream vLLM support
- Expand compatibility testing across additional vLLM/container versions
- Add more public benchmark slices by prompt shape, latency, and throughput
- Downloads last month
- 608