Granite-4.1-8B-Function-Calling-NPU2

Granite 4.1 (8B) – Q4NX for FastFlowLM on AMD Ryzen™ AI NPU (XDNA2 Only)

Model Summary

This is IBM's Granite 4.1 8B base model converted to Q4NX format for hardware-accelerated inference on AMD Ryzen™ AI processors with XDNA2 NPU, using the FastFlowLM engine.

Q4NX (Q4 NPU eXpress) is FastFlowLM's native packed quantization format — a rearranged Q4_1 layout tuned for the NPU's matrix engine tile sizes and memory access patterns. It is not a GGUF file.

Granite-specific scale factors

Granite 4.1 uses three architecture-level multipliers absent from standard Llama. These are baked directly into the Q4NX weights at conversion time so no runtime changes to FastFlowLM are required:

Factor	Value	Applied to
`embedding_scale`	12.0	`embed_tokens` (BF16, element-wise)
`residual_scale`	0.22	`o_proj` and `down_proj` per layer (scales Q4_1 `d`/`m` blocks)
`logit_scale`	16.0	`lm_head` (scales Q4_1 `d`/`m` blocks)
`attn_compensation`	`128^(-0.25)` ≈ 0.2973	`q_proj` and `k_proj` per layer — corrects `1/d` vs `1/√d` attention scale

Requirements

FastFlowLM ≥ 0.1.8 with flm_version: "0.1.8" support
AMD Ryzen™ AI processor with XDNA2 (NPU2) — Strix Point (Ryzen AI 300 series) or later
XRT runtime installed

Quick start

flm serve --hf Atomic-Germ/Granite-4.1-8B-NPU2

Files

File	Description
`model.q4nx`	Q4NX quantized weights (safetensors container)
`config.json`	FastFlowLM model configuration
`tokenizer.json`	Tiktoken-based tokenizer (100352 vocab)
`tokenizer_config.json`	Chat template and special tokens
`*.xclbin`	Precompiled NPU kernels (attn, layer, lm_head, mm, dequant)

Performance (Ryzen AI 9 HX 370)

Metric	Value
Prefill	~1000–1400 tok/s
Decode	~40–44 tok/s
Context window	131072 tokens

License

The converted Q4NX weights are a derivative of ibm-granite/granite-4.1-8b-base and are distributed under the same Apache 2.0 license.

Citation

@misc{granite2024,
  title={Granite 4.1: Open Foundation Models},
  author={IBM Research},
  year={2024},
  url={https://huggingface.co/ibm-granite/granite-4.1-8b-base}
}

Downloads last month: 77

Model tree for Atomic-Germ/Granite-4.1-8B-Function-Calling-NPU2

Base model

ibm-granite/granite-4.1-8b-base

Finetuned

(3)

this model

Collection including Atomic-Germ/Granite-4.1-8B-Function-Calling-NPU2

NPU2/Q4NX

Collection

4 items • Updated 9 days ago