Instructions to use SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic")
model = AutoModelForMultimodalLM.from_pretrained("SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic

SGLang

How to use SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic
```

Gemma 4 26B-A4B Instruct — FP8 Dynamic

FP8 (E4M3) dynamic quantization of google/gemma-4-26B-A4B-it, stored in the compressed-tensors format. Produced as an in-house build for full checkpoint provenance (supplier-assurance / audit), as an alternative to third-party prebuilt checkpoints.

Weights: static per-channel FP8 (E4M3).
Activations: per-token dynamic FP8 — no calibration data required.
Kept at original precision (BF16): MoE router/gate, token embeddings, lm_head, all norms, and the vision tower (this is a text-only serving checkpoint).
MoE experts: quantized per-expert (experts.{i}.{gate,up,down}_proj), the standard compressed-tensors MoE layout.
Size: ~26 GB (vs ~49 GB BF16).

Why FP8 (and not FP4 / NVFP4)

Target hardware is NVIDIA L40S (Ada, SM 8.9), which has native FP8 Tensor Cores but no native FP4. FP8 runs on the fast native path on Ada/Hopper/ Blackwell; the compressed-tensors checkpoint is hardware-portable.

Quantization recipe

Built with llm-compressor using the data-free model_free_ptq entry point:

from llmcompressor import model_free_ptq

model_free_ptq(
    model_stub="google/gemma-4-26B-A4B-it",
    save_directory="gemma-4-26B-A4B-it-FP8-Dynamic",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*embed.*", "re:.*router", "re:.*vision_tower.*", "re:.*norm.*"],
)

Note: re:.*norm.* is required for Gemma 4 because some norms use a numeric suffix (e.g. post_feedforward_layernorm_1) that escapes the default "ends-with-norm" auto-ignore and would otherwise be (incorrectly) targeted.

Usage (vLLM)

The compressed-tensors format is auto-detected — do not pass --quantization. Requires an upstream vLLM with Gemma 4 + compressed-tensors MoE support.

vllm serve SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic \
  --served-model-name gemma \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --reasoning-parser gemma4 \
  --enable-auto-tool-choice --tool-call-parser gemma4

Gemma 4 supports tool calling and a thinking channel (enable_thinking); enable the matching parsers as above.

Validation

Checkpoint structure (keys / dtypes / shapes) matches the reference RedHatAI/gemma-4-26B-A4B-it-FP8-dynamic build.
Quantization integrity verified: experts are F8_E4M3 with per-channel weight_scale; router/norms/embeddings/lm_head left in BF16.
Not yet benchmarked for quality regression vs BF16. Run your own eval (e.g. a task-relevant benchmark) before production use.

License

Derivative of Google Gemma 4 and therefore governed by the Gemma Terms of Use and the Gemma Prohibited Use Policy, which the original model is distributed under. This quantized checkpoint inherits those terms.

Downloads last month: 9

Safetensors

Model size

26B params

Tensor type

BF16

F8_E4M3

Model tree for SilentEight/gemma-4-26B-A4B-it-FP8-Dynamic

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(272)

this model