gemma-4-E2B-it-NVFP4 (draft, text path validated)

Why

I've been trying to get as much performance as possible out of a single DGX Spark with Gemma 4 models. In my local setup, I can run LilaRest/gemma-4-31B-it-NVFP4-turbo at about 8-9 tok/sec on an OpenClaw benchmark suite I built for the Spark.

My hope was that if the Gemma 4 E2B model was accurate enough, I could use it as a draft model and materially improve generation speed when paired with the 31B model. After getting the E2B draft + 31B target setup running in memory on the Spark with a 128k context window, I got a perfect score on my benchmark suite and about 18-20 tok/sec. That was better than I expected, so I'm sharing it here for others to test and build on.

This release is focused on the text serving path used in that validated setup.

What this is

This repository contains an NVFP4-quantized draft/speculative model derived directly from:

It was prepared to be used as a speculative decoding draft model with:

Important distinction:

  • google/gemma-4-E2B-it is the source/base model
  • LilaRest/gemma-4-31B-it-NVFP4-turbo is the validated pairing/target model
  • This repository is not derived from the LilaRest 31B model

Status

  • ✅ Validated for the text serving path with the 31B NVFP4 target model above
  • ✅ Validated on DGX Spark in the known-good 128k setup
  • ⚠️ Vision support is not included in this release path yet
  • ⚠️ This has not been broadly validated across many hardware/software stacks

Intended pairing (important)

Use this draft model specifically with:

  • LilaRest/gemma-4-31B-it-NVFP4-turbo

Using a different target model is untested and may fail or regress.

Quantization

This model uses ModelOpt NVFP4 quantization.

Authoritative quantization metadata is in config.json:

  • quant_algo: NVFP4
  • quant_method: modelopt

About the Hugging Face Safetensors "Tensor type" display

If the Hugging Face UI shows tensor storage types such as:

  • BF16
  • F8_E4M3
  • U8

that does not mean this is a BF16 model instead of an NVFP4 model.

That display reflects the underlying stored tensor dtypes and auxiliary quantization data present in the checkpoint files. The high-level quantization scheme for this repository is still NVFP4, as defined in quantization_config.

Benchmark summary (validated setup)

From local openclaw-bench runs:

  • Speculative draft setup - 31B target + this E2B draft, 128k context, single-sequence bring-up:

    • Aggregate score: 1.00
    • Checks: 66/66 passed
    • Avg throughput: 18.05 tok/s
    • Avg latency: 3465 ms
  • Non-spec baseline runs of 31B alone:

    • Earlier plain baseline run: 8.79 tok/s, 19871 ms
    • Upstream Gemma 4 tool template + Gemma 4 reasoning parser, thinking disabled: 9.02 tok/s, 19359 ms

The stronger 31B-only baseline above used:

  • LilaRest/gemma-4-31B-it-NVFP4-turbo
  • max-model-len 131072
  • max-num-seqs 4
  • gpu-memory-utilization 0.80
  • kv-cache-dtype fp8
  • prefix caching enabled
  • auto tool choice enabled
  • tool-call-parser gemma4
  • reasoning-parser gemma4
  • upstream Gemma 4 chat template with thinking disabled

Observed result in this environment: roughly 2x throughput vs the straight 31B baseline.

Exact vLLM launch approach

This is the exact pattern I used to get the E2B + 31B combo up and running.

First, download the Gemma 4 tool/chat template used in validation:

mkdir -p templates
curl -L https://huggingface.co/mikeytag/gemma-4-E2B-it-NVFP4/resolve/main/templates/gemma4-tool-upstream.jinja \
  -o templates/gemma4-tool-upstream.jinja

Then launch vLLM:

docker run -d --gpus all \
  --name lilarest-nvfp4-vllm \
  -p 8000:8000 \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd)/templates:/templates:ro" \
  --entrypoint /bin/bash \
  vllm/vllm-openai:cu130-nightly -lc '
    set -euo pipefail
    pip install -q "transformers>=5.5.0"

    python3 - <<"PY"
from pathlib import Path

p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py")
text = p.read_text()
old = "except AssertionError:\n        return False"
new = "except (AssertionError, ValueError):\n        return False"
if old in text and new not in text:
    p.write_text(text.replace(old, new, 1))

p = Path("/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py")
text = p.read_text()
needle = "        tgt_kv_cache_group.layer_names.append(layer_name)\n\n        if runner_only_attn_layers is not None:\n"
insert = """        tgt_kv_cache_group.layer_names.append(layer_name)\n\n        if isinstance(tgt_kv_cache_group.kv_cache_spec, UniformTypeKVCacheSpecs):\n            tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[layer_name] = (\n                tgt_kv_cache_group.kv_cache_spec.kv_cache_specs[target_layer_name]\n            )\n\n        if runner_only_attn_layers is not None:\n"""
if needle in text and "kv_cache_specs[layer_name]" not in text:
    p.write_text(text.replace(needle, insert, 1))
PY

    SPEC_JSON=$(python3 - <<"PY"
import json
print(json.dumps({
  "model": "mikeytag/gemma-4-E2B-it-NVFP4",
  "num_speculative_tokens": 4,
  "draft_tensor_parallel_size": 1,
  "quantization": "modelopt"
}))
PY
)

    EXTRA_ARGS=()
    if [[ -f /templates/gemma4-tool-upstream.jinja ]]; then
      EXTRA_ARGS+=(
        --chat-template-content-format openai
        --default-chat-template-kwargs "{\"enable_thinking\": false}"
        --chat-template /templates/gemma4-tool-upstream.jinja
      )
    fi

    exec vllm serve "LilaRest/gemma-4-31B-it-NVFP4-turbo" \
      --host 0.0.0.0 \
      --port 8000 \
      --quantization modelopt \
      --max-model-len 131072 \
      --max-num-seqs 1 \
      --gpu-memory-utilization 0.80 \
      --kv-cache-dtype fp8 \
      --enable-prefix-caching \
      --disable-hybrid-kv-cache-manager \
      --trust-remote-code \
      --enable-auto-tool-choice \
      --tool-call-parser gemma4 \
      "${EXTRA_ARGS[@]}" \
      --speculative-config "$SPEC_JSON"
  '

Caveats

  • Runtime currently depends on two hotfixes applied to vLLM at startup:
    • catch ValueError in vllm/v1/core/kv_cache_utils.py
    • copy KV spec entries for KV-sharing layers in vllm/v1/worker/utils.py
  • Requires --quantization modelopt and matching draft speculative config
  • draft_tensor_parallel_size is set to 1 in the validated configuration
  • This release is focused on the text inference path
  • If you omit the Gemma 4 chat template arguments above, you may see raw markers like <|channel|>thought leak into responses

Limitations

  • Not evaluated as a standalone chat model
  • Pairing is intentionally narrow and validated only with the 31B target model listed above
  • Validation evidence is from local DGX Spark runs
  • Portability to other hardware/software stacks is not guaranteed

Roadmap

  • Publish a vision-capable draft path once validated
  • Remove or replace the runtime hotfix dependency with upstream vLLM support
  • Expand compatibility testing across additional vLLM/container versions
  • Add more public benchmark slices by prompt shape, latency, and throughput
Downloads last month
608
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mikeytag/gemma-4-E2B-it-NVFP4

Quantized
(199)
this model