Instructions to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16")
model = AutoModelForMultimodalLM.from_pretrained("MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16

SGLang

How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16 with Docker Model Runner:
```
docker model run hf.co/MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16
```

Qwen3-VL-4B-Instruct-AWQ-W4A16

This repository provides an AWQ post-training quantized version of Qwen3-VL-4B-Instruct for efficient multimodal inference and evaluation.

Overview

This model is a third-party compressed checkpoint built on top of Qwen3-VL-4B-Instruct, mainly for efficient deployment, benchmarking, and PTQ baseline construction.

The current release uses AWQ W4A16 quantization in the llm-compressor workflow, with group_size=128 and observer="mse".

Compared with the original checkpoint layout, this release also reduces storage footprint in a practical way.

Original size: 4,850,810 KB + 3,816,885 KB
Quantized size: 4,160,642 KB
Compression: -51.998%

Base Model

Base model: Qwen/Qwen3-VL-4B-Instruct
Model family: Qwen3-VL
Quantization method: AWQ
Quantization format: W4A16
Framework: llm-compressor

Quantization Recipe

The released checkpoint follows the following AWQ recipe:

recipe = AWQModifier(
    ignore=[
        "re:.*lm_head", "re:.*visual.*"
    ],
    duo_scaling=False,
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": True,
                "group_size": 128,
                "strategy": "group",
                "dynamic": False,
                "actorder": None,
                "observer": "mse",
            },
        },
    },
)

Notes

lm_head is excluded from quantization.
Modules matching re:.*visual.* are excluded from quantization.
This makes the release a practical language-side AWQ compressed variant while preserving excluded modules at higher precision.

Calibration Setup

Calibration data was constructed from the Flickr30k image-caption dataset, a widely used multimodal benchmark containing 31,783 images and 158,915 English captions (five captions per image).

For AWQ calibration, 128 samples were selected from the local Flickr30k parquet files after dataset loading and random shuffling with a fixed seed (seed=42). Each sample was converted into a multimodal chat-style input consisting of:

one image
one paired caption text
processor-generated multimodal fields such as input_ids, attention_mask, pixel_values, and image_grid_thw

This setup was used to provide representative multimodal activations for post-training quantization in the llm-compressor one-shot workflow.

Calibration Details

Dataset: Flickr30k
Data format: local parquet files
Number of calibration samples: 128
Sampling strategy: shuffled subset with fixed random seed
Max sequence length: 2048
Purpose: multimodal activation calibration for AWQ PTQ

Evaluation Configuration

For evaluation in VLMEvalKit, the following model entry can be added to VLMEvalKit/vlmeval/config.py:

"Qwen3-VL-4B-Instruct-AWQ-W4A16": partial(
    vlm.Qwen3VLChat,
    model_path="MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16",
    use_custom_prompt=False,
    use_vllm=True,
    temperature=0.7,
    max_new_tokens=8192,
    repetition_penalty=1.0,
    presence_penalty=1.5,
    top_p=0.8,
    top_k=20,
)

Intended Use

This release is intended for:

Efficient multimodal inference
PTQ baseline construction for Qwen3-VL
Evaluation with VLMEvalKit
Serving experiments with vLLM
Research on VLM post-training quantization

Disclaimer

This is a third-party quantized checkpoint and is not an official release from the Qwen team.

Quantization may affect model quality on some multimodal tasks, especially fine-grained visual understanding and reasoning benchmarks.

Citation

If you use this model, please cite the original Qwen3-VL report, AWQ, and VLMEvalKit.

@article{bai2025qwen3vl,
  title={Qwen3-VL Technical Report},
  author={Bai, Shuai and Cai, Yuxuan and Zhu, Keming and others},
  journal={arXiv preprint arXiv:2511.21631},
  year={2025}
}

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv preprint arXiv:2306.00978},
  year={2023}
}

@misc{duan2024vlmevalkit,
  title={VLMEvalKit: An Open-Source Toolkit for Evaluating Large Vision-Language Models},
  author={OpenCompass Team},
  howpublished={\url{https://github.com/open-compass/VLMEvalKit}},
  year={2024}
}

@article{young2014image,
  title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia},
  journal={Transactions of the Association for Computational Linguistics},
  volume={2},
  pages={67--78},
  year={2014},
  publisher={MIT Press}
}

Acknowledgement

This repository is built upon the following excellent open-source projects:

Downloads last month: 2,431

Safetensors

Model size

5B params

Tensor type

I64

I32

BF16

Model tree for MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(84)

this model

Papers for MLliu6/Qwen3-VL-4B-Instruct-AWQ-W4A16

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 163

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 13