Script for creating this. llmcompressor installed from source since it depends on something that wasn't compiled into the release at the time.

from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "Qwen/Qwen2-VL-7B-Instruct"

# Load model.
model_class = wrap_hf_model_class(Qwen2VLForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", 
  scheme="FP8_DYNAMIC", 
  ignore=["re:.*lm_head", "re:visual.*"]
)

# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)

# Save the model.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
Downloads last month
4
Safetensors
Model size
8.29B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.