Instructions to use mikeytag/gemma-4-E2B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mikeytag/gemma-4-E2B-it-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mikeytag/gemma-4-E2B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("mikeytag/gemma-4-E2B-it-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("mikeytag/gemma-4-E2B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mikeytag/gemma-4-E2B-it-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mikeytag/gemma-4-E2B-it-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mikeytag/gemma-4-E2B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mikeytag/gemma-4-E2B-it-NVFP4

SGLang

How to use mikeytag/gemma-4-E2B-it-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mikeytag/gemma-4-E2B-it-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mikeytag/gemma-4-E2B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mikeytag/gemma-4-E2B-it-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mikeytag/gemma-4-E2B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mikeytag/gemma-4-E2B-it-NVFP4 with Docker Model Runner:
```
docker model run hf.co/mikeytag/gemma-4-E2B-it-NVFP4
```

gemma-4-E2B-it-NVFP4 (draft, text path validated)

Why

I've been trying to get as much performance as possible out of a single DGX Spark with Gemma 4 models. In my local setup, I can run LilaRest/gemma-4-31B-it-NVFP4-turbo at about 8-9 tok/sec on an OpenClaw benchmark suite I built for the Spark.

My hope was that if the Gemma 4 E2B model was accurate enough, I could use it as a draft model and materially improve generation speed when paired with the 31B model. After getting the E2B draft + 31B target setup running in memory on the Spark with a 128k context window, I got a perfect score on my benchmark suite and about 18-20 tok/sec. That was better than I expected, so I'm sharing it here for others to test and build on.

This release is focused on the text serving path used in that validated setup.

What this is

This repository contains an NVFP4-quantized draft/speculative model derived directly from:

Base model: google/gemma-4-E2B-it

It was prepared to be used as a speculative decoding draft model with:

Validated target model: LilaRest/gemma-4-31B-it-NVFP4-turbo

Important distinction:

google/gemma-4-E2B-it is the source/base model
LilaRest/gemma-4-31B-it-NVFP4-turbo is the validated pairing/target model
This repository is not derived from the LilaRest 31B model

Status

✅ Validated for the text serving path with the 31B NVFP4 target model above
✅ Validated on DGX Spark in the known-good 128k setup
⚠️ Vision support is not included in this release path yet
⚠️ This has not been broadly validated across many hardware/software stacks

Intended pairing (important)

Use this draft model specifically with:

LilaRest/gemma-4-31B-it-NVFP4-turbo

Using a different target model is untested and may fail or regress.

Quantization

This model uses ModelOpt NVFP4 quantization.

Authoritative quantization metadata is in config.json:

quant_algo: NVFP4
quant_method: modelopt

About the Hugging Face Safetensors "Tensor type" display

If the Hugging Face UI shows tensor storage types such as:

BF16
F8_E4M3
U8

that does not mean this is a BF16 model instead of an NVFP4 model.

That display reflects the underlying stored tensor dtypes and auxiliary quantization data present in the checkpoint files. The high-level quantization scheme for this repository is still NVFP4, as defined in quantization_config.

Benchmark summary (validated setup)

From local openclaw-bench runs:

Speculative draft setup - 31B target + this E2B draft, 128k context, single-sequence bring-up:
- Aggregate score: 1.00
- Checks: 66/66 passed
- Avg throughput: 18.05 tok/s
- Avg latency: 3465 ms
Non-spec baseline runs of 31B alone:
- Earlier plain baseline run: 8.79 tok/s, 19871 ms
- Upstream Gemma 4 tool template + Gemma 4 reasoning parser, thinking disabled: 9.02 tok/s, 19359 ms

The stronger 31B-only baseline above used:

LilaRest/gemma-4-31B-it-NVFP4-turbo
max-model-len 131072
max-num-seqs 4
gpu-memory-utilization 0.80
kv-cache-dtype fp8
prefix caching enabled
auto tool choice enabled
tool-call-parser gemma4
reasoning-parser gemma4
upstream Gemma 4 chat template with thinking disabled

Observed result in this environment: roughly 2x throughput vs the straight 31B baseline.

Exact vLLM launch approach

This is the exact pattern I used to get the E2B + 31B combo up and running.

First, download the Gemma 4 tool/chat template used in validation:

mkdir -p templates
curl -L https://huggingface.co/mikeytag/gemma-4-E2B-it-NVFP4/resolve/main/templates/gemma4-tool-upstream.jinja \
  -o templates/gemma4-tool-upstream.jinja

Then launch vLLM:

docker run -d --gpus all \
  --name lilarest-nvfp4-vllm \
  -p 8000:8000 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd)/templates:/templates:ro" \
  --entrypoint /bin/bash \
  vllm/vllm-openai:cu130-nightly -lc '
    set -euo pipefail
    pip install -q "transformers>=5.5.0"

    python3 - <<"PY"
from pathlib import Path

p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py")
text = p.read_text()
old = "except AssertionError:\n        return False"
new = "except (AssertionError, ValueError):\n        return False"
if old in text and new not in text:
    p.write_text(text.replace(old, new, 1))

p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py")
text = p.read_text()
needle = "        tgt_kv_cache_group.layer_names.append(layer_name)\n\n        if runner_only_attn_layers is not None:\n"
insert = """        tgt_kv_cache_group.layer_names.append(layer_name)\n\n        if isinstance(tgt_kv_cache_group.kv_cache_spec, UniformTypeKVCacheSpecs):\n            tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[layer_name] = (\n                tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[target_layer_name]\n            )\n\n        if runner_only_attn_layers is not None:\n"""
if needle in text and "kv_cache_specs[layer_name]" not in text:
    p.write_text(text.replace(needle, insert, 1))
PY

    SPEC_JSON=$(python3 - <<"PY"
import json
print(json.dumps({
  "model": "mikeytag/gemma-4-E2B-it-NVFP4",
  "num_speculative_tokens": 4,
  "draft_tensor_parallel_size": 1,
  "quantization": "modelopt"
}))
PY
)

    EXTRA_ARGS=()
    if [[ -f /templates/gemma4-tool-upstream.jinja ]]; then
      EXTRA_ARGS+=(
        --chat-template-content-format openai
        --default-chat-template-kwargs "{\"enable_thinking\": false}"
        --chat-template /templates/gemma4-tool-upstream.jinja
      )
    fi

    exec vllm serve "LilaRest/gemma-4-31B-it-NVFP4-turbo" \
      --host 0.0.0.0 \
      --port 8000 \
      --quantization modelopt \
      --max-model-len 131072 \
      --max-num-seqs 1 \
      --gpu-memory-utilization 0.80 \
      --kv-cache-dtype fp8 \
      --enable-prefix-caching \
      --disable-hybrid-kv-cache-manager \
      --trust-remote-code \
      --enable-auto-tool-choice \
      --tool-call-parser gemma4 \
      "${EXTRA_ARGS[@]}" \
      --speculative-config "$SPEC_JSON"
  '

Caveats

Runtime currently depends on two hotfixes applied to vLLM at startup:
- catch ValueError in vllm/v1/core/kv_cache_utils.py
- copy KV spec entries for KV-sharing layers in vllm/v1/worker/utils.py
Requires --quantization modelopt and matching draft speculative config
draft_tensor_parallel_size is set to 1 in the validated configuration
This release is focused on the text inference path
If you omit the Gemma 4 chat template arguments above, you may see raw markers like <|channel|>thought leak into responses

Limitations

Not evaluated as a standalone chat model
Pairing is intentionally narrow and validated only with the 31B target model listed above
Validation evidence is from local DGX Spark runs
Portability to other hardware/software stacks is not guaranteed

Roadmap

Publish a vision-capable draft path once validated
Remove or replace the runtime hotfix dependency with upstream vLLM support
Expand compatibility testing across additional vLLM/container versions
Add more public benchmark slices by prompt shape, latency, and throughput

Downloads last month: 608

Safetensors

Model size

4B params

Tensor type

BF16

F8_E4M3

Model tree for mikeytag/gemma-4-E2B-it-NVFP4

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Quantized

(199)

this model