Instructions to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16") model = AutoModelForMultimodalLM.from_pretrained("MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16
- SGLang
How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with Docker Model Runner:
docker model run hf.co/MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16
Qwen3-VL-4B-Instruct-AWQ-W4A16
This repository provides an AWQ post-training quantized version of Qwen3-VL-4B-Instruct for efficient multimodal inference and evaluation.
Overview
This model is a third-party compressed checkpoint built on top of Qwen3-VL-4B-Instruct, mainly for efficient deployment, benchmarking, and PTQ baseline construction.
The current release uses AWQ W4A16 quantization in the llm-compressor workflow, with group_size=128 and observer="mse".
Compared with the original checkpoint layout, this release also reduces storage footprint in a practical way.
- Original size:
4,850,810 KB + 3,816,885 KB - Quantized size:
4,160,642 KB - Compression:
-51.998%
Base Model
- Base model:
Qwen/Qwen3-VL-4B-Instruct - Model family:
Qwen3-VL - Quantization method:
AWQ - Quantization format:
W4A16 - Framework:
llm-compressor
Quantization Recipe
The released checkpoint follows the following AWQ recipe:
recipe = AWQModifier(
ignore=[
"re:.*lm_head", "re:.*visual.*"
],
duo_scaling=False,
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "int",
"symmetric": True,
"group_size": 128,
"strategy": "group",
"dynamic": False,
"actorder": None,
"observer": "mse",
},
},
},
)
Notes
lm_headis excluded from quantization.- Modules matching
re:.*visual.*are excluded from quantization. - This makes the release a practical language-side AWQ compressed variant while preserving excluded modules at higher precision.
Calibration Setup
Calibration data was constructed from the Flickr30k image-caption dataset, a widely used multimodal benchmark containing 31,783 images and 158,915 English captions (five captions per image).
For AWQ calibration, 128 samples were selected from the local Flickr30k parquet files after dataset loading and random shuffling with a fixed seed (seed=42). Each sample was converted into a multimodal chat-style input consisting of:
- one image
- one paired caption text
- processor-generated multimodal fields such as
input_ids,attention_mask,pixel_values, andimage_grid_thw
This setup was used to provide representative multimodal activations for post-training quantization in the llm-compressor one-shot workflow.
Calibration Details
- Dataset: Flickr30k
- Data format: local parquet files
- Number of calibration samples:
128 - Sampling strategy: shuffled subset with fixed random seed
- Max sequence length:
2048 - Purpose: multimodal activation calibration for AWQ PTQ
Evaluation Configuration
For evaluation in VLMEvalKit, the following model entry can be added to VLMEvalKit/vlmeval/config.py:
"Qwen3-VL-4B-Instruct-AWQ-W4A16": partial(
vlm.Qwen3VLChat,
model_path="MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16",
use_custom_prompt=False,
use_vllm=True,
temperature=0.7,
max_new_tokens=8192,
repetition_penalty=1.0,
presence_penalty=1.5,
top_p=0.8,
top_k=20,
)
Intended Use
This release is intended for:
- Efficient multimodal inference
- PTQ baseline construction for Qwen3-VL
- Evaluation with VLMEvalKit
- Serving experiments with vLLM
- Research on VLM post-training quantization
Disclaimer
This is a third-party quantized checkpoint and is not an official release from the Qwen team.
Quantization may affect model quality on some multimodal tasks, especially fine-grained visual understanding and reasoning benchmarks.
Citation
If you use this model, please cite the original Qwen3-VL report, AWQ, and VLMEvalKit.
@article{bai2025qwen3vl,
title={Qwen3-VL Technical Report},
author={Bai, Shuai and Cai, Yuxuan and Zhu, Keming and others},
journal={arXiv preprint arXiv:2511.21631},
year={2025}
}
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv preprint arXiv:2306.00978},
year={2023}
}
@misc{duan2024vlmevalkit,
title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Vision-Language Models},
author={OpenCompass Team},
howpublished={\url{https://github.com/open-compass/VLMEvalKit}},
year={2024}
}
@article{young2014image,
title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia},
journal={Transactions of the Association for Computational Linguistics},
volume={2},
pages={67--78},
year={2014},
publisher={MIT Press}
}
Acknowledgement
This repository is built upon the following excellent open-source projects:
- Downloads last month
- 2,431
Model tree for MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16
Base model
Qwen/Qwen3-VL-8B-Instruct