Instructions to use FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FoeverBLUE/Qwen3-VL-2B-GRACE-BF16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("FoeverBLUE/Qwen3-VL-2B-GRACE-BF16") model = AutoModelForImageTextToText.from_pretrained("FoeverBLUE/Qwen3-VL-2B-GRACE-BF16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
- SGLang
How to use FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 with Docker Model Runner:
docker model run hf.co/FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
Qwen3-VL-2B-GRACE-BF16
This repository contains a full-precision BF16 checkpoint of Qwen3-VL-2B trained using the GRACE framework.
This model is associated with our ICML 2026 paper:
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
Accepted to the International Conference on Machine Learning (ICML 2026)
- Paper: https://arxiv.org/abs/2601.22709
- DOI: https://doi.org/10.48550/arXiv.2601.22709
- Code: https://github.com/ForeverBlue816/GRACE
Model Details
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Training framework: GRACE
- Precision: BF16 full precision
- Training data: ShareGPT4V
- Evaluation protocol: LLaVA-style multimodal evaluation
- Repository: FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
📊 Results
Comparison on 7 VLM benchmarks. The 8B model is the distillation teacher (reference upper bound); all GRACE-Qwen3 variants are 2B students. Best result among the 2B Qwen3-VL models is in bold.
We release GRACE on Qwen3-VL here because it is the most current backbone and gives a fairer, up-to-date point of comparison, with the vanilla Qwen3-VL-2B-Instruct as the baseline. The paper itself reports GRACE on LLaVA-1.5 and Qwen2-VL; we additionally release the LLaVA-1.5 W4G128 INT4 checkpoint from the paper in the model zoo below.
| Model | Params | Precision | HallB | MMBench | ScienceQA | AI2D | MMMU | SEED | MMStar | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B (teacher, ref.) | 8B | BF16 | 61.1 | 84.5 | 85.0 | 85.7 | 69.6 | 77.5 | 70.9 | 76.3 |
| Qwen3-VL-2B (baseline) | 2B | BF16 | 51.4 | 78.4 | 81.4 | 76.9 | 53.4 | 71.2 | 58.3 | 67.3 |
| Qwen3-VL-2B-GRACE | 2B | BF16 | 66.9 | 86.4 | 86.2 | 81.3 | 72.1 | 76.7 | 67.3 | 76.7 |
| Qwen3-VL-2B-GRACE (W8G128) | 2B | INT8 | 66.1 | 85.5 | 85.3 | 80.4 | 71.3 | 75.9 | 66.5 | 75.9 |
| Qwen3-VL-2B-GRACE (W4G128) | 2B | INT4 | 65.4 | 84.6 | 84.3 | 79.5 | 70.5 | 75.1 | 65.8 | 75.0 |
GRACE lifts the Qwen3-VL-2B baseline by +9.4 avg and matches or slightly exceeds the 8B teacher on average (76.7 vs. 76.3) at roughly 1/4 the parameters. The W4G128 INT4 model retains 98% of the BF16 average.
🤗 Model Zoo
| Model | Backbone | Bits | Group | Checkpoint description | HF Hub |
|---|---|---|---|---|---|
| Qwen3-VL-2B-GRACE-BF16 | Qwen3-VL-2B | bf16 | — | Full-precision GRACE checkpoint; used as the student initialization for the W8/W4 Qwen3-VL runs. | FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 |
| Qwen3-VL-2B-GRACE-W8G128 | Qwen3-VL-2B | int8 | 128 | INT8 QAT checkpoint with group size 128; high-retention quantized Qwen3-VL student. | FoeverBLUE/Qwen3-VL-2B-GRACE-W8G128 |
| Qwen3-VL-2B-GRACE-W4G128 | Qwen3-VL-2B | int4 | 128 | INT4 QAT checkpoint with group size 128; compact Qwen3-VL release retaining about 98% of the BF16 average. | FoeverBLUE/Qwen3-VL-2B-GRACE-W4G128 |
| LLaVA-1.5-7B-GRACE-W4G128 | LLaVA-1.5-7B | int4 | 128 | INT4 QAT checkpoint from the GRACE paper with learned scales; released for reproducing the LLaVA-1.5 experiments. | FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128 |
The BF16 Qwen3-VL checkpoint is the full-precision GRACE student used as the initial student weights for the W8 and W4 Qwen3-VL runs. The LLaVA-1.5 W4G128 checkpoint corresponds to the paper setting and includes GRACE-specific QAT quantized weights for reproducing the INT4 LLaVA experiments.
Intended Use
This model is intended for research on:
- Efficient vision-language models
- Knowledge distillation for VLMs
- Multimodal alignment
- Full-precision GRACE training
- BF16 baseline / teacher-student comparison studies
Training Details
This checkpoint is a full-precision BF16 model trained under the GRACE framework.
Configuration:
- Precision: BF16
- Training method: GRACE
- Backbone: Qwen3-VL-2B-Instruct
- Dataset: ShareGPT4V
- Evaluation: LLaVA-style multimodal benchmarks
Unlike the QAT releases, this model does not use weight quantization.
Files
model.safetensors/model-*.safetensorsconfig.jsongeneration_config.json- tokenizer files
- processor files
Loading
from transformers import AutoProcessor
from transformers import AutoModelForImageTextToText
import torch
repo_id = "FoeverBLUE/Qwen3-VL-2B-GRACE-BF16"
processor = AutoProcessor.from_pretrained(
repo_id,
trust_remote_code=True
)
model = AutoModelForImageTextToText.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
Important Notes
This is the full-precision BF16 GRACE checkpoint. It does not include INT8 or INT4 QAT weight compression. For quantized versions, please refer to the W8G128 and W4G128 checkpoints listed in the Model Zoo.
The standard from_pretrained call should load this BF16 checkpoint directly in
a Qwen3-VL-compatible Transformers environment. For reproducing the GRACE
training or distillation pipeline, please refer to the official code repository:
https://github.com/ForeverBlue816/GRACE
Limitations
- This model is released for research purposes.
- Performance may vary depending on the evaluation codebase, preprocessing, generation parameters, and multimodal benchmark implementation.
- Users should follow the license and usage restrictions of the original Qwen3-VL-2B-Instruct base model.
- This checkpoint is not optimized for low-bit inference; use the W8G128 or W4G128 release for quantized deployment studies.
Citation
If you use this model, please cite:
@article{chen2026gated,
title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
journal={arXiv preprint arXiv:2601.22709},
year={2026}
}
- Downloads last month
- 20
Model tree for FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
Base model
Qwen/Qwen3-VL-2B-Instruct