Instructions to use amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0") model = AutoModelForMultimodalLM.from_pretrained("amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0
- SGLang
How to use amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0 with Docker Model Runner:
docker model run hf.co/amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0
Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0
Model Overview
- Model Architecture: LlamaForCausalLM
- Input: Text
- Output: Text
- Source Model: Llama-3.3-70B-Instruct
- Supported Hardware: AMD EPYC (CPU inference)
- Preferred Operating System: Linux
- Inference Engine: vLLM v0.22.0
- Quantization Framework: LLM Compressor v0.11.0
- Quantization Method: 4-bit Weight-Only Quantization (W4A16), Asymmetric
- Compatible Stack:
- ZenDNN v6.0.0
- ZenTorch v2.11.0.1
- PyTorch v2.11.0
- LLM Compressor v0.11.0
- vLLM v0.22.0
ZenTorch v2.11.0.1 for PyTorch v2.11.0 has to be built from source.
This model was Built with Llama. This is a quantized version of Llama-3.3-70B-Instruct created by AMD using LLM Compressor (compressed-tensors) for ZenDNN-optimized CPU inference.
Quantization
The model was quantized from Llama-3.3-70B-Instruct using LLM Compressor v0.11.0.
- Method: 4-bit Weight-Only Quantization (W4A16), Asymmetric
- Config:
compressed-tensors, num_bits=4, type=int, symmetric=false, group_size=128 - Quantized Layers: All
nn.Linearlayers excludinglm_head
The model was produced using LLM Compressor as shown below:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
OUTPUT_DIR = "./Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0"
NUM_CALIB, MAX_SEQ_LEN = 128, 2048
# Load the model and tokenizer (bf16)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="cpu", dtype="bfloat16", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# AWQ recipe: 4-bit asymmetric weight-only, group size 128, skip lm_head
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
# Load and tokenize calibration data (128 samples from ultrachat_200k)
calib = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIB}]")
calib = calib.map(
lambda ex: {"text": "\n".join(m["content"] for m in ex["messages"] if m.get("content"))},
remove_columns=calib.column_names,
)
calib = calib.map(
lambda ex: tokenizer(
ex["text"], truncation=True, max_length=MAX_SEQ_LEN, add_special_tokens=False
),
remove_columns=["text"],
)
# Quantize with the oneshot API
oneshot(
model=model,
dataset=calib,
recipe=recipe,
max_seq_length=MAX_SEQ_LEN,
processor=tokenizer,
)
# Save in compressed-tensors format
model.save_pretrained(OUTPUT_DIR, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_DIR)
Note: This quantization method (W4A16 asymmetric WOQ) is specific to the ZenDNN execution path.
Quick Start
Requirements
pip install \
torch==2.11.0 \
zentorch==2.11.0.1 \
vllm==0.22.0 \
huggingface_hub \
"lm-eval[vllm]==0.4.12"
ZenTorch v2.11.0.1 for PyTorch v2.11.0 has to be built from source.
CPU runtime libraries (only needed if not already present):
conda install -c conda-forge gperftools=2.17.2=h65a8314_0 --no-deps -y
conda install -c conda-forge llvm-openmp=18.1.8=hf5423f3_1 --no-deps -y
Recommended environment variables
# vLLM CPU runtime tuning
export VLLM_USE_AOT_COMPILE=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# TorchInductor
export TORCHINDUCTOR_FREEZING=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=0
# ZenTorch / ZenDNN
export ZENDNNL_MATMUL_ALGO=1
export ZENTORCH_FUSED_MOE=1
For optimal performance, set LD_PRELOAD with libtcmalloc_minimal.so.4 (tcmalloc) and an OpenMP runtime (libiomp5.so for Intel OpenMP, or libomp.so for LLVM OpenMP):
export LD_PRELOAD=<path to lib>/libtcmalloc_minimal.so.4:<path to lib>/libiomp5.so${LD_PRELOAD:+:$LD_PRELOAD}
Locate the libraries with find / -name 'libtcmalloc_minimal.so.4' and find / -name 'libiomp5.so', then substitute the resulting directory for <path to lib>.
Note: Set
LD_PRELOADbefore launching vLLM or any inference script.
Evaluation
The model was evaluated against the BF16 (unquantized) baseline on GSM8K (5-shot) using lm-evaluation-harness with the vLLM engine.
| Benchmark | BF16 Baseline | W4A16-Asym (this model) | Recovery |
|---|---|---|---|
| GSM8K (5-shot, flexible-extract) | - | - | - |
Evaluation Command
lm_eval \
--model vllm \
--model_args pretrained="amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0",dtype=bfloat16 \
--tasks gsm8k \
--batch_size auto \
--trust_remote_code \
--num_fewshot 5 \
--log_samples \
--gen_kwargs "max_gen_toks=2048" \
--apply_chat_template \
--output_path .
Limitations
- Version Lock: This model is compatible with ZenDNN v6.0.0 / PyTorch v2.11.0. It may not load correctly on other versions.
- CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.
License
This model is distributed under the same license as the source model. See the LICENSE file for details.
Modifications copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 1
Model tree for amd/Llama-3.3-70B-Instruct-w4a16-llmcompressor-v0.11.0
Base model
meta-llama/Llama-3.1-70B