Instructions to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ
- SGLang
How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with Docker Model Runner:
docker model run hf.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ
LLaVA-1.5-7B-GRACE-W4G128-AWQ
This repository provides the AWQ-packed INT4 deployment checkpoint of our
GRACE-trained LLaVA-1.5-7B model. The language-model weights are stored as
real packed 4-bit tensors in the AutoAWQ GEMM layout (qweight, qzeros, and
scales), rather than as fake-quantized BF16 tensors.
This model is associated with our ICML 2026 paper:
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li Accepted to the International Conference on Machine Learning (ICML 2026)
- Paper: https://arxiv.org/abs/2601.22709
- DOI: https://doi.org/10.48550/arXiv.2601.22709
- Code: https://github.com/ForeverBlue816/GRACE
- Model repository: https://huggingface.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ
Model Summary
- Backbone: LLaVA-1.5-7B
- Language model: Vicuna-7B-v1.5
- Vision encoder: CLIP ViT-L/14-336
- Method: GRACE: Gated Relational Alignment via Confidence-based Distillation
- Quantization: W4G128 group-wise INT4 quantization-aware training, packed into the AutoAWQ GEMM format
- Training data: ShareGPT4V-style multimodal instruction data
- Evaluation protocol: LLaVA-style multimodal evaluation
- Recommended loader: GRACE / LLaVA-1.5 codebase with AWQ reconstruction utilities
This repository is intended for research on efficient vision-language models, low-bit quantization, quantization-aware training, and multimodal knowledge distillation.
Important Note
This is not a standard drop-in Transformers AWQ checkpoint. The language-model linear layers are stored as packed AWQ tensors and must be reconstructed with the GRACE quantization-aware loading code.
A plain call such as:
AutoModel.from_pretrained("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")
will not correctly reconstruct the INT4 layers. Please use the loading procedure shown below.
Model Zoo
| Model | Backbone | Bits | Group size | Description | HF Hub |
|---|---|---|---|---|---|
| Qwen3-VL-2B-GRACE-BF16 | Qwen3-VL-2B | BF16 | — | Full-precision GRACE checkpoint used as the student initialization for the Qwen3-VL W8/W4 runs. | FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 |
| Qwen3-VL-2B-GRACE-W8G128 | Qwen3-VL-2B | INT8 | 128 | INT8 QAT checkpoint with group size 128. | FoeverBLUE/Qwen3-VL-2B-GRACE-W8G128 |
| Qwen3-VL-2B-GRACE-W4G128 | Qwen3-VL-2B | INT4 | 128 | INT4 QAT checkpoint with group size 128. | FoeverBLUE/Qwen3-VL-2B-GRACE-W4G128 |
| LLaVA-1.5-7B-GRACE-W4G128 | LLaVA-1.5-7B | INT4 | 128 | QAT checkpoint with BF16 weights constrained to the INT4 grid and a quantized-weight sidecar. | FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128 |
| LLaVA-1.5-7B-GRACE-W4G128-AWQ | LLaVA-1.5-7B | INT4 | 128 | This repository. Real AWQ-packed deployment build with qweight, qzeros, and scales. |
FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128-AWQ |
The LLaVA-1.5-7B-GRACE-W4G128 repository contains the QAT checkpoint, while this
repository contains the same model packed into real 4-bit AWQ tensors for
deployment and storage-efficient inference.
Quantization Details
This checkpoint uses W4G128 group-wise INT4 quantization-aware training and is packed into the AutoAWQ GEMM tensor format.
Weight precision: INT4
Grouping: group size 128
QAT scheme: symmetric signed per-group Learned Step Size quantization
Integer code range:
[-8, 7]AWQ representation: the symmetric QAT codes are represented in the AWQ GEMM layout using a constant zero-point offset
Packed tensors:
qweight,qzeros, andscalesQuantized modules: language-model linear layers only
self_attn.q_projself_attn.k_projself_attn.v_projself_attn.o_projmlp.gate_projmlp.up_projmlp.down_proj
Kept in FP16: CLIP vision tower, multimodal projector, token embeddings, LM head, and normalization layers
Footprint: approximately 14.2 GB in BF16 to approximately 4.6 GB in packed INT4 format
Specialized INT4 kernels such as autoawq-kernels can be used for practical
inference acceleration. Without fused kernels, the GRACE loader can still
reconstruct and run the model through a correct dequantization path, although it
may be slower.
Repository Files
This repository contains the following main files:
config.json: model configuration. Themm_vision_towerfield should point toopenai/clip-vit-large-patch14-336.model.safetensors: checkpoint file containing AWQ-packed tensors for the quantized language-model linear layers and FP16 tensors for the remaining modules.awq_quantized_modules.json: metadata listing the AWQ-packed module names, bit width, and group size required by the GRACE loader.tokenizer.model: SentencePiece tokenizer model.tokenizer_config.json: tokenizer configuration.special_tokens_map.json: special-token mapping.generation_config.json: generation configuration.
Loading
This checkpoint should be loaded through the GRACE / LLaVA-1.5 codebase.
git clone https://github.com/ForeverBlue816/GRACE
cd GRACE/deployment
Download the checkpoint from Hugging Face:
from huggingface_hub import snapshot_download
ckpt_dir = snapshot_download("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")
print(ckpt_dir)
Run one-shot inference with the provided deployment script:
python scripts/deploy_awq_llava.py \
--load-packed /path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ \
--image-file your_image.jpg \
--query "Describe this image in detail." \
--conv-mode vicuna_v1
Programmatic loading:
import os
import glob
import json
from safetensors.torch import load_file
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.quantize import build_awq_skeleton
ckpt_dir = "/path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ"
meta = json.load(open(os.path.join(ckpt_dir, "awq_quantized_modules.json")))
tokenizer, model, image_processor, context_len = load_pretrained_model(
ckpt_dir,
None,
get_model_name_from_path(ckpt_dir),
device_map="cuda",
device="cuda",
)
build_awq_skeleton(
model,
meta["modules"],
bits=meta["bits"],
group_size=meta["group_size"],
device="cuda",
)
state_dict = {}
for path in glob.glob(os.path.join(ckpt_dir, "*.safetensors")):
state_dict.update(load_file(path))
prefixes = tuple(name + "." for name in meta["modules"])
awq_state_dict = {
key: value
for key, value in state_dict.items()
if key.startswith(prefixes)
}
missing, unexpected = model.load_state_dict(awq_state_dict, strict=False)
model.eval()
The CLIP vision tower is resolved from the mm_vision_tower field in
config.json. By default, this field should be:
"mm_vision_tower": "openai/clip-vit-large-patch14-336"
Evaluation
This checkpoint follows the LLaVA-style multimodal evaluation protocol and is evaluated with greedy decoding. Representative benchmarks include:
- VQAv2
- GQA
- TextVQA
- POPE
- MME
- ScienceQA
- SEED-Bench
- MMBench
This AWQ-packed checkpoint is a deployment-oriented repacking of the GRACE-trained LLaVA-1.5-7B W4G128 model. It is intended to reproduce the INT4 LLaVA-1.5 results reported in the GRACE paper, subject to the same evaluation codebase, preprocessing, generation settings, and benchmark versions.
Please refer to the paper and the GRACE repository for the full experimental setup and benchmark results.
Intended Use
This model is intended for research and development in:
- Efficient vision-language models
- Low-bit VLM quantization
- Quantization-aware training
- Multimodal knowledge distillation
- Storage-efficient and deployment-oriented VLM inference
- Comparisons with FP16, INT8, PTQ, AWQ, GPTQ, and other compression methods
Out-of-Scope Use
This model is not intended for safety-critical or high-stakes applications, including but not limited to medical, legal, financial, or security-sensitive decision-making. The model may produce hallucinated, biased, or incorrect outputs and should be evaluated carefully before deployment.
Training Data
The model was trained using ShareGPT4V-style multimodal instruction data.
Dataset:
Lin-Chen/ShareGPT4V
The training setup follows a LLaVA-style multimodal instruction-tuning and evaluation pipeline.
Limitations
- This checkpoint requires the GRACE quantization-aware loading code.
- It is not a standard drop-in Transformers AWQ checkpoint.
- Runtime speed depends on the availability of optimized INT4 kernels.
- Performance may vary depending on preprocessing, prompt templates, decoding settings, and benchmark implementations.
- Users should comply with the license and usage terms of the original LLaVA-1.5, Vicuna, CLIP, and training-data sources.
Citation
If you use this model, please cite the corresponding GRACE paper:
@article{chen2026gated,
title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
journal={arXiv preprint arXiv:2601.22709},
year={2026}
}
Please also cite the original LLaVA and Vicuna works when using this model.
License
This model is released under the Apache-2.0 license unless otherwise specified. Users should also comply with the license and usage terms of the base model, vision encoder, tokenizer, and training data.
- Downloads last month
- 22
Model tree for ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ
Base model
liuhaotian/llava-v1.5-7b