Qwen3-VL-8B-Instruct-AWQ-W4A16

This repository provides an AWQ post-training quantized version of Qwen3-VL-8B-Instruct for efficient multimodal inference and evaluation.

Overview

This model is a third-party compressed checkpoint built on top of Qwen3-VL-8B-Instruct, mainly for efficient deployment, benchmarking, and PTQ baseline construction.

The current release uses AWQ W4A16 quantization in the llm-compressor workflow, with group_size=128 and observer="mse".

Compared with the original checkpoint layout, this release also reduces storage footprint in a practical way.

  • Original size: 4,787,379 KB + 4,800,745 KB + 4,882,648 KB + 2,652,608 KB
  • Quantized size: 4,880,425 KB + 2,174,541 KB
  • Compression: -58.799%

Base Model

  • Base model: Qwen/Qwen3-VL-8B-Instruct
  • Model family: Qwen3-VL
  • Quantization method: AWQ
  • Quantization format: W4A16
  • Framework: llm-compressor

Quantization Recipe

The released checkpoint follows the following AWQ recipe:

recipe = AWQModifier(
    ignore=[
        "re:.*lm_head", "re:.*visual.*"
    ],
    duo_scaling=False,
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": True,
                "group_size": 128,
                "strategy": "group",
                "dynamic": False,
                "actorder": None,
                "observer": "mse",
            },
        },
    },
)

Notes

  • lm_head is excluded from quantization.
  • Modules matching re:.*visual.* are excluded from quantization.
  • This release is intended as a practical AWQ-compressed checkpoint for deployment and evaluation.

Calibration Setup

Calibration data was constructed from the Flickr30k image-caption dataset, a widely used multimodal benchmark containing 31,783 images and 158,915 English captions (five captions per image).

For AWQ calibration, 128 samples were selected from the local Flickr30k parquet files after dataset loading and random shuffling with a fixed seed (seed=42). Each sample was converted into a multimodal chat-style input consisting of:

  • one image
  • one paired caption text
  • processor-generated multimodal fields such as input_ids, attention_mask, pixel_values, and image_grid_thw

This setup was used to provide representative multimodal activations for post-training quantization in the llm-compressor one-shot workflow.

Calibration Details

  • Dataset: Flickr30k
  • Data format: local parquet files
  • Number of calibration samples: 128
  • Sampling strategy: shuffled subset with fixed random seed
  • Max sequence length: 2048
  • Purpose: multimodal activation calibration for AWQ PTQ

Evaluation Configuration

For evaluation in VLMEvalKit, the following model entry can be added to VLMEvalKit/vlmeval/config.py:

"Qwen3-VL-8B-Instruct-AWQ-W4A16": partial(
    vlm.Qwen3VLChat,
    model_path="MLliu6/Qwen3-VL-8B-Instruct-AWQ-W4A16",
    use_custom_prompt=False,
    use_vllm=True,
    temperature=0.7,
    max_new_tokens=8192,
    repetition_penalty=1.0,
    presence_penalty=1.5,
    top_p=0.8,
    top_k=20,
)

Intended Use

This release is intended for:

  • Efficient multimodal inference
  • PTQ baseline construction for Qwen3-VL
  • Evaluation with VLMEvalKit
  • Serving experiments with vLLM
  • Research on VLM post-training quantization

Disclaimer

This is a third-party quantized checkpoint and is not an official release from the Qwen team.

Quantization may affect model quality on some multimodal tasks, especially fine-grained visual understanding and reasoning benchmarks.

Citation

If you use this model, please cite the original Qwen3-VL report, AWQ, and VLMEvalKit.

@article{bai2025qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Bai, Shuai and Cai, Yuxuan and Zhu, Keming and others},
  journal={arXiv preprint arXiv:2511.21631},
  year={2025}
}

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv preprint arXiv:2306.00978},
  year={2023}
}

@misc{duan2024vlmevalkit,
  title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Vision-Language Models},
  author={OpenCompass Team},
  howpublished={\url{https://github.com/open-compass/VLMEvalKit}},
  year={2024}
}

@article{young2014image,
  title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia},
  journal={Transactions of the Association for Computational Linguistics},
  volume={2},
  pages={67--78},
  year={2014},
  publisher={MIT Press}
}

Acknowledgement

This repository is built upon the following excellent open-source projects:

Downloads last month
4,337
Safetensors
Model size
3B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MLliu6/Qwen3-VL-8B-Instruct-AWQ-W4A16

Quantized
(84)
this model

Papers for MLliu6/Qwen3-VL-8B-Instruct-AWQ-W4A16