Instructions to use ForeverBlue/Qwen3-VL-2B-GRACE-W8G128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ForeverBlue/Qwen3-VL-2B-GRACE-W8G128 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ForeverBlue/Qwen3-VL-2B-GRACE-W8G128")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ForeverBlue/Qwen3-VL-2B-GRACE-W8G128")
model = AutoModelForImageTextToText.from_pretrained("ForeverBlue/Qwen3-VL-2B-GRACE-W8G128")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ForeverBlue/Qwen3-VL-2B-GRACE-W8G128 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

SGLang

How to use ForeverBlue/Qwen3-VL-2B-GRACE-W8G128 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ForeverBlue/Qwen3-VL-2B-GRACE-W8G128 with Docker Model Runner:
```
docker model run hf.co/ForeverBlue/Qwen3-VL-2B-GRACE-W8G128
```

Qwen3-VL-2B-GRACE-W8G128

This repository contains a GRACE-trained Qwen3-VL-2B checkpoint using quantization-aware training (QAT) with W8G128 group-wise INT8 quantization.

This model is associated with our ICML 2026 paper:

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
Accepted to the International Conference on Machine Learning (ICML 2026)

Paper: https://arxiv.org/abs/2601.22709
DOI: https://doi.org/10.48550/arXiv.2601.22709
Code: https://github.com/ForeverBlue816/GRACE

Model Details

Base model: Qwen/Qwen3-VL-2B-Instruct
Method: GRACE
Quantization: W8G128 group-wise INT8 QAT
Training data: ShareGPT4V
Training / evaluation protocol: LLaVA-style multimodal evaluation
Library: Hugging Face Transformers
Repository: ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

📊 Results

Comparison on 7 VLM benchmarks. The 8B model is the distillation teacher (reference upper bound); all GRACE-Qwen3 variants are 2B students. Best result among the 2B Qwen3-VL models is in bold.

We release GRACE on Qwen3-VL here because it is the most current backbone and gives a fairer, up-to-date point of comparison, with the vanilla Qwen3-VL-2B-Instruct as the baseline. The paper itself reports GRACE on LLaVA-1.5 and Qwen2-VL; we additionally release the LLaVA-1.5 W4G128 INT4 checkpoint from the paper in the model zoo below.

Model	Params	Precision	HallB	MMBench	ScienceQA	AI2D	MMMU	SEED	MMStar	Avg
Qwen3-VL-8B (teacher, ref.)	8B	BF16	61.1	84.5	85.0	85.7	69.6	77.5	70.9	76.3
Qwen3-VL-2B (baseline)	2B	BF16	51.4	78.4	81.4	76.9	53.4	71.2	58.3	67.3
Qwen3-VL-2B-GRACE	2B	BF16	66.9	86.4	86.2	81.3	72.1	76.7	67.3	76.7
Qwen3-VL-2B-GRACE (W8G128)	2B	INT8	66.1	85.5	85.3	80.4	71.3	75.9	66.5	75.9
Qwen3-VL-2B-GRACE (W4G128)	2B	INT4	65.4	84.6	84.3	79.5	70.5	75.1	65.8	75.0

GRACE lifts the Qwen3-VL-2B baseline by +9.4 avg and matches or slightly exceeds the 8B teacher on average (76.7 vs. 76.3) at roughly 1/4 the parameters. The W8G128 INT8 model retains 99% of the BF16 average.

🤗 Model Zoo

Model	Backbone	Bits	Group	Checkpoint description	HF Hub
Qwen3-VL-2B-GRACE-BF16	Qwen3-VL-2B	bf16	—	Full-precision GRACE checkpoint; used as the student initialization for the W8/W4 Qwen3-VL runs.	FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
Qwen3-VL-2B-GRACE-W8G128	Qwen3-VL-2B	int8	128	INT8 QAT checkpoint with group size 128; high-retention quantized Qwen3-VL student.	FoeverBLUE/Qwen3-VL-2B-GRACE-W8G128
Qwen3-VL-2B-GRACE-W4G128	Qwen3-VL-2B	int4	128	INT4 QAT checkpoint with group size 128; compact Qwen3-VL release retaining about 98% of the BF16 average.	FoeverBLUE/Qwen3-VL-2B-GRACE-W4G128
LLaVA-1.5-7B-GRACE-W4G128	LLaVA-1.5-7B	int4	128	INT4 QAT checkpoint from the GRACE paper with learned scales; released for reproducing the LLaVA-1.5 experiments.	FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128

The BF16 Qwen3-VL checkpoint is the full-precision GRACE student used as the initial student weights for the W8 and W4 Qwen3-VL runs. The LLaVA-1.5 W4G128 checkpoint corresponds to the paper setting and includes GRACE-specific QAT quantized weights for reproducing the INT4 LLaVA experiments.

Intended Use

This model is intended for research purposes, including:

Efficient vision-language models
Quantization-aware training
Low-bit multimodal model deployment
Knowledge distillation for VLM compression
Multimodal model efficiency studies

Out-of-Scope Use

This checkpoint is not intended for:

Safety-critical deployment
Medical / legal / financial decision-making
Production systems requiring reliability guarantees

Like other VLMs, the model may generate hallucinated, biased, or incorrect outputs.

Training Data

The model was trained using ShareGPT4V multimodal instruction data under a LLaVA-style multimodal fine-tuning pipeline.

Dataset:

Lin-Chen/ShareGPT4V

Quantization Details

This checkpoint uses quantization-aware training (QAT) with group-wise W8G128 quantization.

Configuration:

Weight precision: INT8
Group size: 128
Quantization scheme: Group-wise QAT
Method: GRACE
Backbone: Qwen3-VL-2B-Instruct

Depending on the inference backend, specialized quantized kernels or custom loading logic may be required to obtain real INT8 deployment benefits.

Repository Files

This repository may contain:

model.safetensors / model-*.safetensors — model weights
qat_quantized_weights.bin — QAT quantized weight artifact
config.json — model configuration
generation_config.json — generation configuration
tokenizer files
processor / preprocessing configuration files

Loading

Please use a Qwen3-VL-compatible Transformers environment or the official Qwen3-VL codebase.

from transformers import AutoProcessor
from transformers import AutoModelForImageTextToText

repo_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128"

processor = AutoProcessor.from_pretrained(
    repo_id,
    trust_remote_code=True
)

model = AutoModelForImageTextToText.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto"
)

Recommended:

recent transformers version
Qwen3-VL compatible environment
CUDA GPU inference backend for large-scale evaluation

Evaluation

The checkpoint follows a LLaVA-style multimodal evaluation protocol.

Representative evaluation may include benchmarks such as:

HallusionBench
MMBench
ScienceQA
AI2D
MMMU
SEED-Bench
MMStar

Please refer to the associated GRACE paper and the results table above for detailed evaluation settings and results.

Important Notes

This checkpoint includes QAT-specific quantized weights in qat_quantized_weights.bin. Depending on the inference codebase, additional GRACE-specific quantization-aware loading logic may be required.

The standard from_pretrained call may load the model configuration and checkpoint files, but fully reproducing the intended INT8 QAT behavior may require the GRACE repository:

https://github.com/ForeverBlue816/GRACE

Limitations

This model is released for research purposes.
The quantized checkpoint may require custom loading logic for QAT-specific weights.
Performance may vary depending on the evaluation codebase, preprocessing, generation parameters, and multimodal benchmark implementation.
Users should follow the license and usage restrictions of the original Qwen3-VL-2B-Instruct base model.
Specialized kernels or custom loading code may be required to realize practical INT8 speed or memory benefits.

Citation

If you use this model, please cite:

@article{chen2026gated,
  title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
  author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
  journal={arXiv preprint arXiv:2601.22709},
  year={2026}
}

Please also cite the original Qwen3-VL work when using this model.

License

Released under the MIT license.

Users should additionally comply with:

Qwen3-VL base model license
ShareGPT4V dataset terms
applicable downstream usage restrictions

Downloads last month: 62

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

Base model

Qwen/Qwen3-VL-2B-Instruct

Quantized

(68)

this model

Dataset used to train ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

Collection including ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

GRACE

Collection

[ICML 2026] GRACE: Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs. • 7 items • Updated 2 days ago • 1

Paper for ForeverBlue/Qwen3-VL-2B-GRACE-W8G128

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Paper • 2601.22709 • Published Jan 30 • 1