Instructions to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ

SGLang

How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ with Docker Model Runner:
```
docker model run hf.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ
```

LLaVA-1.5-7B-GRACE-W4G128-AWQ

This repository provides the AWQ-packed INT4 deployment checkpoint of our GRACE-trained LLaVA-1.5-7B model. The language-model weights are stored as real packed 4-bit tensors in the AutoAWQ GEMM layout (qweight, qzeros, and scales), rather than as fake-quantized BF16 tensors.

This model is associated with our ICML 2026 paper:

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li Accepted to the International Conference on Machine Learning (ICML 2026)

Paper: https://arxiv.org/abs/2601.22709
DOI: https://doi.org/10.48550/arXiv.2601.22709
Code: https://github.com/ForeverBlue816/GRACE
Model repository: https://huggingface.co/ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ

Model Summary

Backbone: LLaVA-1.5-7B
Language model: Vicuna-7B-v1.5
Vision encoder: CLIP ViT-L/14-336
Method: GRACE: Gated Relational Alignment via Confidence-based Distillation
Quantization: W4G128 group-wise INT4 quantization-aware training, packed into the AutoAWQ GEMM format
Training data: ShareGPT4V-style multimodal instruction data
Evaluation protocol: LLaVA-style multimodal evaluation
Recommended loader: GRACE / LLaVA-1.5 codebase with AWQ reconstruction utilities

This repository is intended for research on efficient vision-language models, low-bit quantization, quantization-aware training, and multimodal knowledge distillation.

Important Note

This is not a standard drop-in Transformers AWQ checkpoint. The language-model linear layers are stored as packed AWQ tensors and must be reconstructed with the GRACE quantization-aware loading code.

A plain call such as:

AutoModel.from_pretrained("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")

will not correctly reconstruct the INT4 layers. Please use the loading procedure shown below.

Model Zoo

Model	Backbone	Bits	Group size	Description	HF Hub
Qwen3-VL-2B-GRACE-BF16	Qwen3-VL-2B	BF16	—	Full-precision GRACE checkpoint used as the student initialization for the Qwen3-VL W8/W4 runs.	FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
Qwen3-VL-2B-GRACE-W8G128	Qwen3-VL-2B	INT8	128	INT8 QAT checkpoint with group size 128.	FoeverBLUE/Qwen3-VL-2B-GRACE-W8G128
Qwen3-VL-2B-GRACE-W4G128	Qwen3-VL-2B	INT4	128	INT4 QAT checkpoint with group size 128.	FoeverBLUE/Qwen3-VL-2B-GRACE-W4G128
LLaVA-1.5-7B-GRACE-W4G128	LLaVA-1.5-7B	INT4	128	QAT checkpoint with BF16 weights constrained to the INT4 grid and a quantized-weight sidecar.	FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128
LLaVA-1.5-7B-GRACE-W4G128-AWQ	LLaVA-1.5-7B	INT4	128	This repository. Real AWQ-packed deployment build with `qweight`, `qzeros`, and `scales`.	FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128-AWQ

The LLaVA-1.5-7B-GRACE-W4G128 repository contains the QAT checkpoint, while this repository contains the same model packed into real 4-bit AWQ tensors for deployment and storage-efficient inference.

Quantization Details

This checkpoint uses W4G128 group-wise INT4 quantization-aware training and is packed into the AutoAWQ GEMM tensor format.

Weight precision: INT4
Grouping: group size 128
QAT scheme: symmetric signed per-group Learned Step Size quantization
Integer code range: [-8, 7]
AWQ representation: the symmetric QAT codes are represented in the AWQ GEMM layout using a constant zero-point offset
Packed tensors: qweight, qzeros, and scales
Quantized modules: language-model linear layers only
- self_attn.q_proj
- self_attn.k_proj
- self_attn.v_proj
- self_attn.o_proj
- mlp.gate_proj
- mlp.up_proj
- mlp.down_proj
Kept in FP16: CLIP vision tower, multimodal projector, token embeddings, LM head, and normalization layers
Footprint: approximately 14.2 GB in BF16 to approximately 4.6 GB in packed INT4 format

Specialized INT4 kernels such as autoawq-kernels can be used for practical inference acceleration. Without fused kernels, the GRACE loader can still reconstruct and run the model through a correct dequantization path, although it may be slower.

Repository Files

This repository contains the following main files:

config.json: model configuration. The mm_vision_tower field should point to openai/clip-vit-large-patch14-336.
model.safetensors: checkpoint file containing AWQ-packed tensors for the quantized language-model linear layers and FP16 tensors for the remaining modules.
awq_quantized_modules.json: metadata listing the AWQ-packed module names, bit width, and group size required by the GRACE loader.
tokenizer.model: SentencePiece tokenizer model.
tokenizer_config.json: tokenizer configuration.
special_tokens_map.json: special-token mapping.
generation_config.json: generation configuration.

Loading

This checkpoint should be loaded through the GRACE / LLaVA-1.5 codebase.

git clone https://github.com/ForeverBlue816/GRACE
cd GRACE/deployment

Download the checkpoint from Hugging Face:

from huggingface_hub import snapshot_download

ckpt_dir = snapshot_download("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")
print(ckpt_dir)

Run one-shot inference with the provided deployment script:

python scripts/deploy_awq_llava.py \
    --load-packed /path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ \
    --image-file your_image.jpg \
    --query "Describe this image in detail." \
    --conv-mode vicuna_v1

Programmatic loading:

import os
import glob
import json
from safetensors.torch import load_file

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.quantize import build_awq_skeleton

ckpt_dir = "/path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ"

meta = json.load(open(os.path.join(ckpt_dir, "awq_quantized_modules.json")))

tokenizer, model, image_processor, context_len = load_pretrained_model(
    ckpt_dir,
    None,
    get_model_name_from_path(ckpt_dir),
    device_map="cuda",
    device="cuda",
)

build_awq_skeleton(
    model,
    meta["modules"],
    bits=meta["bits"],
    group_size=meta["group_size"],
    device="cuda",
)

state_dict = {}
for path in glob.glob(os.path.join(ckpt_dir, "*.safetensors")):
    state_dict.update(load_file(path))

prefixes = tuple(name + "." for name in meta["modules"])
awq_state_dict = {
    key: value
    for key, value in state_dict.items()
    if key.startswith(prefixes)
}

missing, unexpected = model.load_state_dict(awq_state_dict, strict=False)
model.eval()

The CLIP vision tower is resolved from the mm_vision_tower field in config.json. By default, this field should be:

"mm_vision_tower": "openai/clip-vit-large-patch14-336"

Evaluation

This checkpoint follows the LLaVA-style multimodal evaluation protocol and is evaluated with greedy decoding. Representative benchmarks include:

VQAv2
GQA
TextVQA
POPE
MME
ScienceQA
SEED-Bench
MMBench

This AWQ-packed checkpoint is a deployment-oriented repacking of the GRACE-trained LLaVA-1.5-7B W4G128 model. It is intended to reproduce the INT4 LLaVA-1.5 results reported in the GRACE paper, subject to the same evaluation codebase, preprocessing, generation settings, and benchmark versions.

Please refer to the paper and the GRACE repository for the full experimental setup and benchmark results.

Intended Use

This model is intended for research and development in:

Efficient vision-language models
Low-bit VLM quantization
Quantization-aware training
Multimodal knowledge distillation
Storage-efficient and deployment-oriented VLM inference
Comparisons with FP16, INT8, PTQ, AWQ, GPTQ, and other compression methods

Out-of-Scope Use

This model is not intended for safety-critical or high-stakes applications, including but not limited to medical, legal, financial, or security-sensitive decision-making. The model may produce hallucinated, biased, or incorrect outputs and should be evaluated carefully before deployment.

Training Data

The model was trained using ShareGPT4V-style multimodal instruction data.

Dataset:

Lin-Chen/ShareGPT4V

The training setup follows a LLaVA-style multimodal instruction-tuning and evaluation pipeline.

Limitations

This checkpoint requires the GRACE quantization-aware loading code.
It is not a standard drop-in Transformers AWQ checkpoint.
Runtime speed depends on the availability of optimized INT4 kernels.
Performance may vary depending on preprocessing, prompt templates, decoding settings, and benchmark implementations.
Users should comply with the license and usage terms of the original LLaVA-1.5, Vicuna, CLIP, and training-data sources.

Citation

If you use this model, please cite the corresponding GRACE paper:

@article{chen2026gated,
  title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
  author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
  journal={arXiv preprint arXiv:2601.22709},
  year={2026}
}

Please also cite the original LLaVA and Vicuna works when using this model.

License

This model is released under the Apache-2.0 license unless otherwise specified. Users should also comply with the license and usage terms of the base model, vision encoder, tokenizer, and training data.