Instructions to use barryke/granite-4.1-8b-FP8-DYNAMIC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use barryke/granite-4.1-8b-FP8-DYNAMIC with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="barryke/granite-4.1-8b-FP8-DYNAMIC")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("barryke/granite-4.1-8b-FP8-DYNAMIC")
model = AutoModelForCausalLM.from_pretrained("barryke/granite-4.1-8b-FP8-DYNAMIC")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use barryke/granite-4.1-8b-FP8-DYNAMIC with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "barryke/granite-4.1-8b-FP8-DYNAMIC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/granite-4.1-8b-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/barryke/granite-4.1-8b-FP8-DYNAMIC

SGLang

How to use barryke/granite-4.1-8b-FP8-DYNAMIC with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "barryke/granite-4.1-8b-FP8-DYNAMIC" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/granite-4.1-8b-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "barryke/granite-4.1-8b-FP8-DYNAMIC" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/granite-4.1-8b-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use barryke/granite-4.1-8b-FP8-DYNAMIC with Docker Model Runner:
```
docker model run hf.co/barryke/granite-4.1-8b-FP8-DYNAMIC
```

granite-4.1-8b-FP8-DYNAMIC

Model Description

This is an FP8 dynamic quantized version of ibm-granite/granite-4.1-8b, IBM Granite's 8B-parameter long-context instruct model. Granite-4.1-8B was finetuned from Granite-4.1-8B-Base with supervised fine-tuning and reinforcement-learning alignment for strong tool-calling, instruction-following, and chat — with a native 128K context window.

Quantization was performed using LLM Compressor v0.11.0 via a post-training one-shot method (no calibration data required). The checkpoint is saved in the compressed-tensors format, natively supported by vLLM and transformers.

Quantization Details

Property	Value
Base model	`ibm-granite/granite-4.1-8b`
Quantization method	`compressed-tensors` (via LLM Compressor `oneshot`)
Scheme	`FP8_DYNAMIC`
Weight quantization	FP8 (float-quantized), per-channel, symmetric
Activation quantization	FP8 (float-quantized), per-token, dynamic
Targets	All `Linear` layers
Ignored layers	`lm_head` (kept in original precision; tied to `embed_tokens`)
LLM Compressor version	0.11.0
compressed-tensors version	0.16.0
Calibration data	None required (dynamic activations)
Total size on disk	~9 GB (down from ~18 GB original BF16, ~50% reduction)

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

Quantization Code

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-4.1-8b", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.1-8b")

recipe = QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
oneshot(model=model, recipe=recipe)

model.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")
tokenizer.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")

Why FP8 DYNAMIC?

No calibration data needed — dynamic activation quantization computes scales at runtime per-token, so no representative dataset is required during quantization.
Near-lossless accuracy — FP8 preserves the full dynamic range of the original model with minimal degradation.
~50% size reduction — FP8 weights halve the storage and memory footprint vs. the original BF16 model.
Hardware acceleration — natively supported on NVIDIA Hopper (H100), Ada Lovelace (L40S / RTX 4090), and Blackwell GPUs.

Model Architecture

Granite-4.1-8B is a decoder-only dense transformer with GQA, RoPE, SwiGLU MLP, RMSNorm, and tied input/output embeddings. It also uses Granite's scaled-multiplier scheme (attention_multiplier, embedding_multiplier, logits_scaling, residual_multiplier) baked into the forward pass — these are preserved verbatim by quantization.

Hyperparameter	Value
Architecture	`GraniteForCausalLM`
Model type	`granite`
Total parameters	~8B (counted as ~9B on the HF card, including tied embeddings)
Layers	40
Hidden size	4096
MLP intermediate size	12800
Attention heads	32
KV heads (GQA)	8
Head dimension	128
Vocabulary size	100,352
Max position embeddings	131,072 (128K native context)
Activation	SwiGLU (`silu`)
RoPE theta	10,000,000
RMS norm epsilon	1e-5
Attention multiplier	0.0078125
Embedding multiplier	12.0
Logits scaling	16.0
Residual multiplier	0.22
Tied embeddings	Yes (`lm_head.weight` = `model.embed_tokens.weight`)
Original dtype	bfloat16

Long-Context (up to 128K)

Unlike RoPE-scaled models, Granite-4.1-8B natively supports 131,072-token contexts out of the box — no YaRN factor or max_position_embeddings edits required. Just make sure your serving stack (vLLM ≥ 0.6, transformers ≥ 4.45) allocates enough KV-cache memory.

Capabilities

This quantized model preserves the capabilities of the original granite-4.1-8b:

Summarization, classification, extraction, Q&A and RAG across business and general-purpose text.
Code generation & FIM — supports fill-in-the-middle completions in addition to standard code generation.
Function/tool calling — emits structured <tool_call>...</tool_call> JSON for OpenAI-style tool schemas (see below).
Multilingual dialog — trained on 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.

Representative base-model benchmarks (8B Dense, from IBM's card)

Benchmark	Setting	Score
MMLU	5-shot	73.84
MMLU-Pro	5-shot, CoT	55.99
BBH	3-shot, CoT	80.51
GPQA	0-shot, CoT	41.96
IFEval Avg	—	87.06
GSM8K	8-shot	92.49
HumanEval	pass@1	85.37
MBPP	pass@1	87.30
BFCL v3	—	68.27

How to Use

vLLM (recommended for production)

pip install vllm
vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC

With tool-calling support (Granite's native tool parser):

vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC \
  --enable-auto-tool-choice \
  --tool-call-parser granite

transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "barryke/granite-4.1-8b-FP8-DYNAMIC"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

chat = [
    {"role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location."},
]
input_ids = tokenizer.apply_chat_template(
    chat,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
# IBM Almaden Research Laboratory, San Jose, California, United States.

Tool calling (transformers)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a specified city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string", "description": "Name of the city"}},
                "required": ["city"],
            },
        },
    }
]

chat = [{"role": "user", "content": "What's the weather like in Boston right now?"}]
input_ids = tokenizer.apply_chat_template(
    chat, tools=tools, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
out = model.generate(input_ids, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][input_ids.shape[-1]:], skip_special_tokens=False))
# <tool_call>
# {"name": "get_current_weather", "arguments": {"city": "Boston"}}
# </tool_call>

SGLang

pip install sglang
python3 -m sglang.launch_server \
  --model-path barryke/granite-4.1-8b-FP8-DYNAMIC \
  --host 0.0.0.0 \
  --port 30000

Recommendations

Use the Granite chat template — always call tokenizer.apply_chat_template(...) with add_generation_prompt=True. The template wraps messages with <|start_of_role|> / <|end_of_role|> markers the model was trained on.
Hardware requirement — FP8 inference requires an NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace / Hopper / Blackwell). For other GPUs, use the original BF16 model or an INT4 (W4A16) quantization variant.
KV-cache budget — at 128K context, KV-cache dominates memory; size --gpu-memory-utilization / max-model-len accordingly when serving.
Pair with Granite Guardian — IBM recommends deploying ibm-granite/granite-guardian-4.1-8b alongside Granite instruct models for risk detection in enterprise settings.

Known Limitations

Multilingual asymmetry — while trained on 12 languages, performance on non-English tasks may lag English; few-shot prompting helps.
Hallucinations — like all instruct LLMs, the model can produce inaccurate or fabricated content, especially outside its training distribution.
Safety — although aligned for safety, the model may still produce biased or unsafe outputs in some cases; domain-specific safety testing is recommended before deployment.

License

This model inherits the Apache License 2.0 from the base model.

Citation

@misc{granite41,
  title  = {Granite 4.1 Language Models},
  author = {{IBM Granite Team}},
  year   = {2026},
  url    = {https://huggingface.co/ibm-granite/granite-4.1-8b},
  note   = {Apache 2.0 licensed 8B dense instruct model with 128K context}
}

@software{llm-compressor,
  title  = {{LLM Compressor: An easy-to-use library for compressing LLMs}},
  author = {{Neuralmagic, vLLM Project}},
  url    = {https://github.com/vllm-project/llm-compressor},
  note   = {Used v0.11.0 to produce this FP8-DYNAMIC checkpoint}
}

Downloads last month: 29

Safetensors

Model size

9B params

Tensor type

BF16

F8_E4M3

Model tree for barryke/granite-4.1-8b-FP8-DYNAMIC

Base model

ibm-granite/granite-4.1-8b

Quantized

(49)

this model