Instructions to use barryke/granite-4.1-8b-FP8-DYNAMIC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use barryke/granite-4.1-8b-FP8-DYNAMIC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="barryke/granite-4.1-8b-FP8-DYNAMIC") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("barryke/granite-4.1-8b-FP8-DYNAMIC") model = AutoModelForCausalLM.from_pretrained("barryke/granite-4.1-8b-FP8-DYNAMIC") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use barryke/granite-4.1-8b-FP8-DYNAMIC with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "barryke/granite-4.1-8b-FP8-DYNAMIC" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barryke/granite-4.1-8b-FP8-DYNAMIC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/barryke/granite-4.1-8b-FP8-DYNAMIC
- SGLang
How to use barryke/granite-4.1-8b-FP8-DYNAMIC with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "barryke/granite-4.1-8b-FP8-DYNAMIC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barryke/granite-4.1-8b-FP8-DYNAMIC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "barryke/granite-4.1-8b-FP8-DYNAMIC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barryke/granite-4.1-8b-FP8-DYNAMIC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use barryke/granite-4.1-8b-FP8-DYNAMIC with Docker Model Runner:
docker model run hf.co/barryke/granite-4.1-8b-FP8-DYNAMIC
granite-4.1-8b-FP8-DYNAMIC
Model Description
This is an FP8 dynamic quantized version of ibm-granite/granite-4.1-8b, IBM Granite's 8B-parameter long-context instruct model. Granite-4.1-8B was finetuned from Granite-4.1-8B-Base with supervised fine-tuning and reinforcement-learning alignment for strong tool-calling, instruction-following, and chat — with a native 128K context window.
Quantization was performed using LLM Compressor v0.11.0 via a post-training one-shot method (no calibration data required). The checkpoint is saved in the compressed-tensors format, natively supported by vLLM and transformers.
Quantization Details
| Property | Value |
|---|---|
| Base model | ibm-granite/granite-4.1-8b |
| Quantization method | compressed-tensors (via LLM Compressor oneshot) |
| Scheme | FP8_DYNAMIC |
| Weight quantization | FP8 (float-quantized), per-channel, symmetric |
| Activation quantization | FP8 (float-quantized), per-token, dynamic |
| Targets | All Linear layers |
| Ignored layers | lm_head (kept in original precision; tied to embed_tokens) |
| LLM Compressor version | 0.11.0 |
| compressed-tensors version | 0.16.0 |
| Calibration data | None required (dynamic activations) |
| Total size on disk | ~9 GB (down from ~18 GB original BF16, ~50% reduction) |
Quantization Recipe
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: FP8_DYNAMIC
bypass_divisibility_checks: false
Quantization Code
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-4.1-8b", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.1-8b")
recipe = QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
oneshot(model=model, recipe=recipe)
model.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")
tokenizer.save_pretrained("./granite-4.1-8b-FP8-DYNAMIC")
Why FP8 DYNAMIC?
- No calibration data needed — dynamic activation quantization computes scales at runtime per-token, so no representative dataset is required during quantization.
- Near-lossless accuracy — FP8 preserves the full dynamic range of the original model with minimal degradation.
- ~50% size reduction — FP8 weights halve the storage and memory footprint vs. the original BF16 model.
- Hardware acceleration — natively supported on NVIDIA Hopper (H100), Ada Lovelace (L40S / RTX 4090), and Blackwell GPUs.
Model Architecture
Granite-4.1-8B is a decoder-only dense transformer with GQA, RoPE, SwiGLU MLP, RMSNorm, and tied input/output embeddings. It also uses Granite's scaled-multiplier scheme (attention_multiplier, embedding_multiplier, logits_scaling, residual_multiplier) baked into the forward pass — these are preserved verbatim by quantization.
| Hyperparameter | Value |
|---|---|
| Architecture | GraniteForCausalLM |
| Model type | granite |
| Total parameters | ~8B (counted as ~9B on the HF card, including tied embeddings) |
| Layers | 40 |
| Hidden size | 4096 |
| MLP intermediate size | 12800 |
| Attention heads | 32 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Vocabulary size | 100,352 |
| Max position embeddings | 131,072 (128K native context) |
| Activation | SwiGLU (silu) |
| RoPE theta | 10,000,000 |
| RMS norm epsilon | 1e-5 |
| Attention multiplier | 0.0078125 |
| Embedding multiplier | 12.0 |
| Logits scaling | 16.0 |
| Residual multiplier | 0.22 |
| Tied embeddings | Yes (lm_head.weight = model.embed_tokens.weight) |
| Original dtype | bfloat16 |
Long-Context (up to 128K)
Unlike RoPE-scaled models, Granite-4.1-8B natively supports 131,072-token contexts out of the box — no YaRN factor or max_position_embeddings edits required. Just make sure your serving stack (vLLM ≥ 0.6, transformers ≥ 4.45) allocates enough KV-cache memory.
Capabilities
This quantized model preserves the capabilities of the original granite-4.1-8b:
- Summarization, classification, extraction, Q&A and RAG across business and general-purpose text.
- Code generation & FIM — supports fill-in-the-middle completions in addition to standard code generation.
- Function/tool calling — emits structured
<tool_call>...</tool_call>JSON for OpenAI-style tool schemas (see below). - Multilingual dialog — trained on 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.
Representative base-model benchmarks (8B Dense, from IBM's card)
| Benchmark | Setting | Score |
|---|---|---|
| MMLU | 5-shot | 73.84 |
| MMLU-Pro | 5-shot, CoT | 55.99 |
| BBH | 3-shot, CoT | 80.51 |
| GPQA | 0-shot, CoT | 41.96 |
| IFEval Avg | — | 87.06 |
| GSM8K | 8-shot | 92.49 |
| HumanEval | pass@1 | 85.37 |
| MBPP | pass@1 | 87.30 |
| BFCL v3 | — | 68.27 |
How to Use
vLLM (recommended for production)
pip install vllm
vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC
With tool-calling support (Granite's native tool parser):
vllm serve barryke/granite-4.1-8b-FP8-DYNAMIC \
--enable-auto-tool-choice \
--tool-call-parser granite
transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "barryke/granite-4.1-8b-FP8-DYNAMIC"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
chat = [
{"role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location."},
]
input_ids = tokenizer.apply_chat_template(
chat,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
output_ids = model.generate(
input_ids,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
# IBM Almaden Research Laboratory, San Jose, California, United States.
Tool calling (transformers)
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a specified city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string", "description": "Name of the city"}},
"required": ["city"],
},
},
}
]
chat = [{"role": "user", "content": "What's the weather like in Boston right now?"}]
input_ids = tokenizer.apply_chat_template(
chat, tools=tools, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
out = model.generate(input_ids, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][input_ids.shape[-1]:], skip_special_tokens=False))
# <tool_call>
# {"name": "get_current_weather", "arguments": {"city": "Boston"}}
# </tool_call>
SGLang
pip install sglang
python3 -m sglang.launch_server \
--model-path barryke/granite-4.1-8b-FP8-DYNAMIC \
--host 0.0.0.0 \
--port 30000
Recommendations
- Use the Granite chat template — always call
tokenizer.apply_chat_template(...)withadd_generation_prompt=True. The template wraps messages with<|start_of_role|>/<|end_of_role|>markers the model was trained on. - Hardware requirement — FP8 inference requires an NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace / Hopper / Blackwell). For other GPUs, use the original BF16 model or an INT4 (W4A16) quantization variant.
- KV-cache budget — at 128K context, KV-cache dominates memory; size
--gpu-memory-utilization/max-model-lenaccordingly when serving. - Pair with Granite Guardian — IBM recommends deploying ibm-granite/granite-guardian-4.1-8b alongside Granite instruct models for risk detection in enterprise settings.
Known Limitations
- Multilingual asymmetry — while trained on 12 languages, performance on non-English tasks may lag English; few-shot prompting helps.
- Hallucinations — like all instruct LLMs, the model can produce inaccurate or fabricated content, especially outside its training distribution.
- Safety — although aligned for safety, the model may still produce biased or unsafe outputs in some cases; domain-specific safety testing is recommended before deployment.
License
This model inherits the Apache License 2.0 from the base model.
Citation
@misc{granite41,
title = {Granite 4.1 Language Models},
author = {{IBM Granite Team}},
year = {2026},
url = {https://huggingface.co/ibm-granite/granite-4.1-8b},
note = {Apache 2.0 licensed 8B dense instruct model with 128K context}
}
@software{llm-compressor,
title = {{LLM Compressor: An easy-to-use library for compressing LLMs}},
author = {{Neuralmagic, vLLM Project}},
url = {https://github.com/vllm-project/llm-compressor},
note = {Used v0.11.0 to produce this FP8-DYNAMIC checkpoint}
}
- Downloads last month
- 29
Model tree for barryke/granite-4.1-8b-FP8-DYNAMIC
Base model
ibm-granite/granite-4.1-8b