FP8 LLMs for vLLM
Collection
Accurate FP8 quantized models by Neural Magic, ready for use with vLLM!
•
10 items
•
Updated
•
15
Mixtral-8x22B-Instruct-v0.1 quantized to FP8 static weights and dynamic activations, ready for inference with vLLM >= 0.5.0.
Produced using AutoFP8 with activation_scheme="dynamic"
with block_sparse_moe.gate
layers kept at original precision.
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "mistralai/Mixtral-8x22B-Instruct-v0.1"
quantized_model_dir = "Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="dynamic",
ignore_patterns=["re:.*lm_head", "re:.*gate"],
)
examples = []
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)