Granite 4.1 (8B) – Q4NX for FastFlowLM on AMD Ryzen™ AI NPU (XDNA2 Only)

Model Summary

This is IBM's Granite 4.1 8B base model converted to Q4NX format for hardware-accelerated inference on AMD Ryzen™ AI processors with XDNA2 NPU, using the FastFlowLM engine.

Q4NX (Q4 NPU eXpress) is FastFlowLM's native packed quantization format — a rearranged Q4_1 layout tuned for the NPU's matrix engine tile sizes and memory access patterns. It is not a GGUF file.

Granite-specific scale factors

Granite 4.1 uses three architecture-level multipliers absent from standard Llama. These are baked directly into the Q4NX weights at conversion time so no runtime changes to FastFlowLM are required:

Factor Value Applied to
embedding_scale 12.0 embed_tokens (BF16, element-wise)
residual_scale 0.22 o_proj and down_proj per layer (scales Q4_1 d/m blocks)
logit_scale 16.0 lm_head (scales Q4_1 d/m blocks)
attn_compensation 128^(-0.25) ≈ 0.2973 q_proj and k_proj per layer — corrects 1/d vs 1/√d attention scale

Requirements

  • FastFlowLM ≥ 0.1.8 with flm_version: "0.1.8" support
  • AMD Ryzen™ AI processor with XDNA2 (NPU2) — Strix Point (Ryzen AI 300 series) or later
  • XRT runtime installed

Quick start

flm serve --hf Atomic-Germ/Granite-4.1-8B-NPU2

Files

File Description
model.q4nx Q4NX quantized weights (safetensors container)
config.json FastFlowLM model configuration
tokenizer.json Tiktoken-based tokenizer (100352 vocab)
tokenizer_config.json Chat template and special tokens
*.xclbin Precompiled NPU kernels (attn, layer, lm_head, mm, dequant)

Performance (Ryzen AI 9 HX 370)

Metric Value
Prefill ~1000–1400 tok/s
Decode ~40–44 tok/s
Context window 131072 tokens

License

The converted Q4NX weights are a derivative of ibm-granite/granite-4.1-8b-base and are distributed under the same Apache 2.0 license.

Citation

@misc{granite2024,
  title={Granite 4.1: Open Foundation Models},
  author={IBM Research},
  year={2024},
  url={https://huggingface.co/ibm-granite/granite-4.1-8b-base}
}
Downloads last month
77
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Atomic-Germ/Granite-4.1-8B-Function-Calling-NPU2

Finetuned
(3)
this model

Collection including Atomic-Germ/Granite-4.1-8B-Function-Calling-NPU2