granite-4.1-8b-FP8-DYNAMIC

Model Description

This is an FP8 dynamic quantized version of ibm-granite/granite-4.1-8b, IBM Granite's 8B-parameter long-context instruct model. Granite-4.1-8B was finetuned from Granite-4.1-8B-Base with supervised fine-tuning and reinforcement-learning alignment for strong tool-calling, instruction-following, and chat — with a native 128K context window.

Quantization was performed using LLM Compressor v0.11.0 via a post-training one-shot method (no calibration data required). The checkpoint is saved in the compressed-tensors format, natively supported by vLLM and transformers.

Quantization Details

Property Value
Base model ibm-granite/granite-4.1-8b
Quantization method compressed-tensors (via LLM Compressor oneshot)
Scheme FP8_DYNAMIC
Weight quantization FP8 (float-quantized), per-channel, symmetric
Activation quantization FP8 (float-quantized), per-token, dynamic
Targets All Linear layers
Ignored layers lm_head (kept in original precision; tied to embed_tokens)
LLM Compressor version 0.11.0
compressed-tensors version 0.16.0
Calibration data None required (dynamic activations)
Total size on disk ~9 GB (down from ~18 GB original BF16, ~50% reduction)

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

Quantization Code

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-4.1-8b", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.1-8b")

recipe = QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
oneshot(model=model, recipe=recipe)

model.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")
tokenizer.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")

Why FP8 DYNAMIC?

  • No calibration data needed — dynamic activation quantization computes scales at runtime per-token, so no representative dataset is required during quantization.
  • Near-lossless accuracy — FP8 preserves the full dynamic range of the original model with minimal degradation.
  • ~50% size reduction — FP8 weights halve the storage and memory footprint vs. the original BF16 model.
  • Hardware acceleration — natively supported on NVIDIA Hopper (H100), Ada Lovelace (L40S / RTX 4090), and Blackwell GPUs.

Model Architecture

Granite-4.1-8B is a decoder-only dense transformer with GQA, RoPE, SwiGLU MLP, RMSNorm, and tied input/output embeddings. It also uses Granite's scaled-multiplier scheme (attention_multiplier, embedding_multiplier, logits_scaling, residual_multiplier) baked into the forward pass — these are preserved verbatim by quantization.

Hyperparameter Value
Architecture GraniteForCausalLM
Model type granite
Total parameters ~8B (counted as ~9B on the HF card, including tied embeddings)
Layers 40
Hidden size 4096
MLP intermediate size 12800
Attention heads 32
KV heads (GQA) 8
Head dimension 128
Vocabulary size 100,352
Max position embeddings 131,072 (128K native context)
Activation SwiGLU (silu)
RoPE theta 10,000,000
RMS norm epsilon 1e-5
Attention multiplier 0.0078125
Embedding multiplier 12.0
Logits scaling 16.0
Residual multiplier 0.22
Tied embeddings Yes (lm_head.weight = model.embed_tokens.weight)
Original dtype bfloat16

Long-Context (up to 128K)

Unlike RoPE-scaled models, Granite-4.1-8B natively supports 131,072-token contexts out of the box — no YaRN factor or max_position_embeddings edits required. Just make sure your serving stack (vLLM ≥ 0.6, transformers ≥ 4.45) allocates enough KV-cache memory.

Capabilities

This quantized model preserves the capabilities of the original granite-4.1-8b:

  • Summarization, classification, extraction, Q&A and RAG across business and general-purpose text.
  • Code generation & FIM — supports fill-in-the-middle completions in addition to standard code generation.
  • Function/tool calling — emits structured <tool_call>...</tool_call> JSON for OpenAI-style tool schemas (see below).
  • Multilingual dialog — trained on 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.

Representative base-model benchmarks (8B Dense, from IBM's card)

Benchmark Setting Score
MMLU 5-shot 73.84
MMLU-Pro 5-shot, CoT 55.99
BBH 3-shot, CoT 80.51
GPQA 0-shot, CoT 41.96
IFEval Avg 87.06
GSM8K 8-shot 92.49
HumanEval pass@1 85.37
MBPP pass@1 87.30
BFCL v3 68.27

How to Use

vLLM (recommended for production)

pip install vllm
vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC

With tool-calling support (Granite's native tool parser):

vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC \
  --enable-auto-tool-choice \
  --tool-call-parser granite

transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "barryke/granite-4.1-8b-FP8-DYNAMIC"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

chat = [
    {"role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location."},
]
input_ids = tokenizer.apply_chat_template(
    chat,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
# IBM Almaden Research Laboratory, San Jose, California, United States.

Tool calling (transformers)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a specified city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string", "description": "Name of the city"}},
                "required": ["city"],
            },
        },
    }
]

chat = [{"role": "user", "content": "What's the weather like in Boston right now?"}]
input_ids = tokenizer.apply_chat_template(
    chat, tools=tools, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
out = model.generate(input_ids, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][input_ids.shape[-1]:], skip_special_tokens=False))
# <tool_call>
# {"name": "get_current_weather", "arguments": {"city": "Boston"}}
# </tool_call>

SGLang

pip install sglang
python3 -m sglang.launch_server \
  --model-path barryke/granite-4.1-8b-FP8-DYNAMIC \
  --host 0.0.0.0 \
  --port 30000

Recommendations

  • Use the Granite chat template — always call tokenizer.apply_chat_template(...) with add_generation_prompt=True. The template wraps messages with <|start_of_role|> / <|end_of_role|> markers the model was trained on.
  • Hardware requirement — FP8 inference requires an NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace / Hopper / Blackwell). For other GPUs, use the original BF16 model or an INT4 (W4A16) quantization variant.
  • KV-cache budget — at 128K context, KV-cache dominates memory; size --gpu-memory-utilization / max-model-len accordingly when serving.
  • Pair with Granite Guardian — IBM recommends deploying ibm-granite/granite-guardian-4.1-8b alongside Granite instruct models for risk detection in enterprise settings.

Known Limitations

  • Multilingual asymmetry — while trained on 12 languages, performance on non-English tasks may lag English; few-shot prompting helps.
  • Hallucinations — like all instruct LLMs, the model can produce inaccurate or fabricated content, especially outside its training distribution.
  • Safety — although aligned for safety, the model may still produce biased or unsafe outputs in some cases; domain-specific safety testing is recommended before deployment.

License

This model inherits the Apache License 2.0 from the base model.

Citation

@misc{granite41,
  title  = {Granite 4.1 Language Models},
  author = {{IBM Granite Team}},
  year   = {2026},
  url    = {https://huggingface.co/ibm-granite/granite-4.1-8b},
  note   = {Apache 2.0 licensed 8B dense instruct model with 128K context}
}
@software{llm-compressor,
  title  = {{LLM Compressor: An easy-to-use library for compressing LLMs}},
  author = {{Neuralmagic, vLLM Project}},
  url    = {https://github.com/vllm-project/llm-compressor},
  note   = {Used v0.11.0 to produce this FP8-DYNAMIC checkpoint}
}
Downloads last month
29
Safetensors
Model size
9B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for barryke/granite-4.1-8b-FP8-DYNAMIC

Quantized
(49)
this model