LLaVA-1.5-7B-GRACE-W4G128-AWQ

This repository provides the AWQ-packed INT4 deployment checkpoint of our GRACE-trained LLaVA-1.5-7B model. The language-model weights are stored as real packed 4-bit tensors in the AutoAWQ GEMM layout (qweight, qzeros, and scales), rather than as fake-quantized BF16 tensors.

This model is associated with our ICML 2026 paper:

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li Accepted to the International Conference on Machine Learning (ICML 2026)


Model Summary

  • Backbone: LLaVA-1.5-7B
  • Language model: Vicuna-7B-v1.5
  • Vision encoder: CLIP ViT-L/14-336
  • Method: GRACE: Gated Relational Alignment via Confidence-based Distillation
  • Quantization: W4G128 group-wise INT4 quantization-aware training, packed into the AutoAWQ GEMM format
  • Training data: ShareGPT4V-style multimodal instruction data
  • Evaluation protocol: LLaVA-style multimodal evaluation
  • Recommended loader: GRACE / LLaVA-1.5 codebase with AWQ reconstruction utilities

This repository is intended for research on efficient vision-language models, low-bit quantization, quantization-aware training, and multimodal knowledge distillation.


Important Note

This is not a standard drop-in Transformers AWQ checkpoint. The language-model linear layers are stored as packed AWQ tensors and must be reconstructed with the GRACE quantization-aware loading code.

A plain call such as:

AutoModel.from_pretrained("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")

will not correctly reconstruct the INT4 layers. Please use the loading procedure shown below.


Model Zoo

Model Backbone Bits Group size Description HF Hub
Qwen3-VL-2B-GRACE-BF16 Qwen3-VL-2B BF16 Full-precision GRACE checkpoint used as the student initialization for the Qwen3-VL W8/W4 runs. FoeverBLUE/Qwen3-VL-2B-GRACE-BF16
Qwen3-VL-2B-GRACE-W8G128 Qwen3-VL-2B INT8 128 INT8 QAT checkpoint with group size 128. FoeverBLUE/Qwen3-VL-2B-GRACE-W8G128
Qwen3-VL-2B-GRACE-W4G128 Qwen3-VL-2B INT4 128 INT4 QAT checkpoint with group size 128. FoeverBLUE/Qwen3-VL-2B-GRACE-W4G128
LLaVA-1.5-7B-GRACE-W4G128 LLaVA-1.5-7B INT4 128 QAT checkpoint with BF16 weights constrained to the INT4 grid and a quantized-weight sidecar. FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128
LLaVA-1.5-7B-GRACE-W4G128-AWQ LLaVA-1.5-7B INT4 128 This repository. Real AWQ-packed deployment build with qweight, qzeros, and scales. FoeverBLUE/LLaVA-1.5-7B-GRACE-W4G128-AWQ

The LLaVA-1.5-7B-GRACE-W4G128 repository contains the QAT checkpoint, while this repository contains the same model packed into real 4-bit AWQ tensors for deployment and storage-efficient inference.


Quantization Details

This checkpoint uses W4G128 group-wise INT4 quantization-aware training and is packed into the AutoAWQ GEMM tensor format.

  • Weight precision: INT4

  • Grouping: group size 128

  • QAT scheme: symmetric signed per-group Learned Step Size quantization

  • Integer code range: [-8, 7]

  • AWQ representation: the symmetric QAT codes are represented in the AWQ GEMM layout using a constant zero-point offset

  • Packed tensors: qweight, qzeros, and scales

  • Quantized modules: language-model linear layers only

    • self_attn.q_proj
    • self_attn.k_proj
    • self_attn.v_proj
    • self_attn.o_proj
    • mlp.gate_proj
    • mlp.up_proj
    • mlp.down_proj
  • Kept in FP16: CLIP vision tower, multimodal projector, token embeddings, LM head, and normalization layers

  • Footprint: approximately 14.2 GB in BF16 to approximately 4.6 GB in packed INT4 format

Specialized INT4 kernels such as autoawq-kernels can be used for practical inference acceleration. Without fused kernels, the GRACE loader can still reconstruct and run the model through a correct dequantization path, although it may be slower.


Repository Files

This repository contains the following main files:

  • config.json: model configuration. The mm_vision_tower field should point to openai/clip-vit-large-patch14-336.
  • model.safetensors: checkpoint file containing AWQ-packed tensors for the quantized language-model linear layers and FP16 tensors for the remaining modules.
  • awq_quantized_modules.json: metadata listing the AWQ-packed module names, bit width, and group size required by the GRACE loader.
  • tokenizer.model: SentencePiece tokenizer model.
  • tokenizer_config.json: tokenizer configuration.
  • special_tokens_map.json: special-token mapping.
  • generation_config.json: generation configuration.

Loading

This checkpoint should be loaded through the GRACE / LLaVA-1.5 codebase.

git clone https://github.com/ForeverBlue816/GRACE
cd GRACE/deployment

Download the checkpoint from Hugging Face:

from huggingface_hub import snapshot_download

ckpt_dir = snapshot_download("ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ")
print(ckpt_dir)

Run one-shot inference with the provided deployment script:

python scripts/deploy_awq_llava.py \
    --load-packed /path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ \
    --image-file your_image.jpg \
    --query "Describe this image in detail." \
    --conv-mode vicuna_v1

Programmatic loading:

import os
import glob
import json
from safetensors.torch import load_file

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.quantize import build_awq_skeleton

ckpt_dir = "/path/to/LLaVA-1.5-7B-GRACE-W4G128-AWQ"

meta = json.load(open(os.path.join(ckpt_dir, "awq_quantized_modules.json")))

tokenizer, model, image_processor, context_len = load_pretrained_model(
    ckpt_dir,
    None,
    get_model_name_from_path(ckpt_dir),
    device_map="cuda",
    device="cuda",
)

build_awq_skeleton(
    model,
    meta["modules"],
    bits=meta["bits"],
    group_size=meta["group_size"],
    device="cuda",
)

state_dict = {}
for path in glob.glob(os.path.join(ckpt_dir, "*.safetensors")):
    state_dict.update(load_file(path))

prefixes = tuple(name + "." for name in meta["modules"])
awq_state_dict = {
    key: value
    for key, value in state_dict.items()
    if key.startswith(prefixes)
}

missing, unexpected = model.load_state_dict(awq_state_dict, strict=False)
model.eval()

The CLIP vision tower is resolved from the mm_vision_tower field in config.json. By default, this field should be:

"mm_vision_tower": "openai/clip-vit-large-patch14-336"

Evaluation

This checkpoint follows the LLaVA-style multimodal evaluation protocol and is evaluated with greedy decoding. Representative benchmarks include:

  • VQAv2
  • GQA
  • TextVQA
  • POPE
  • MME
  • ScienceQA
  • SEED-Bench
  • MMBench

This AWQ-packed checkpoint is a deployment-oriented repacking of the GRACE-trained LLaVA-1.5-7B W4G128 model. It is intended to reproduce the INT4 LLaVA-1.5 results reported in the GRACE paper, subject to the same evaluation codebase, preprocessing, generation settings, and benchmark versions.

Please refer to the paper and the GRACE repository for the full experimental setup and benchmark results.


Intended Use

This model is intended for research and development in:

  • Efficient vision-language models
  • Low-bit VLM quantization
  • Quantization-aware training
  • Multimodal knowledge distillation
  • Storage-efficient and deployment-oriented VLM inference
  • Comparisons with FP16, INT8, PTQ, AWQ, GPTQ, and other compression methods

Out-of-Scope Use

This model is not intended for safety-critical or high-stakes applications, including but not limited to medical, legal, financial, or security-sensitive decision-making. The model may produce hallucinated, biased, or incorrect outputs and should be evaluated carefully before deployment.


Training Data

The model was trained using ShareGPT4V-style multimodal instruction data.

Dataset:

  • Lin-Chen/ShareGPT4V

The training setup follows a LLaVA-style multimodal instruction-tuning and evaluation pipeline.


Limitations

  • This checkpoint requires the GRACE quantization-aware loading code.
  • It is not a standard drop-in Transformers AWQ checkpoint.
  • Runtime speed depends on the availability of optimized INT4 kernels.
  • Performance may vary depending on preprocessing, prompt templates, decoding settings, and benchmark implementations.
  • Users should comply with the license and usage terms of the original LLaVA-1.5, Vicuna, CLIP, and training-data sources.

Citation

If you use this model, please cite the corresponding GRACE paper:

@article{chen2026gated,
  title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
  author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
  journal={arXiv preprint arXiv:2601.22709},
  year={2026}
}

Please also cite the original LLaVA and Vicuna works when using this model.


License

This model is released under the Apache-2.0 license unless otherwise specified. Users should also comply with the license and usage terms of the base model, vision encoder, tokenizer, and training data.

Downloads last month
22
Safetensors
Model size
1B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ

Quantized
(4)
this model

Dataset used to train ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ

Collection including ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ

Paper for ForeverBlue/LLaVA-1.5-7B-GRACE-W4G128-AWQ