Qwen3.5-4B-AWQ

AWQ INT4 quantization of Qwen/Qwen3.5-4B using llm-compressor and 512 OpenPlatypus calibration samples.

2.56× smaller on disk and ~61.7% lower VRAM usage while maintaining strong benchmark performance.


Model compression

Memory & storage reduction

BF16 baseline AWQ INT4 (this model)
Model size ~8.0 GB ~3.13 GB (2.56x smaller)
VRAM at load ~8.0 GB ~3.06 GB (2.61x smaller)
Bits / weight 16 4 (4x fewer)

Benchmarks

Note: these are the quantized model's standalone scores from EleutherAI lm-evaluation-harness, default settings, 0-shot. HellaSwag and ARC-Challenge use acc_norm; PIQA, Winogrande, and ARC-Easy use acc, matching each task's harness default. A matched BF16-vs-INT4 delta on identical hardware and settings has not yet been run for this model; treat the scores below as standalone results rather than a verified quantization delta.

Benchmark Metric Score
PIQA acc 77.69
Winogrande acc 68.75
HellaSwag acc_norm 71.65
ARC-Easy acc 73.53
ARC-Challenge acc_norm 51.71

Average score: 68.67%

Evaluation Results


Quantization recipe

Setting Value
Method AWQ
Scheme W4A16_ASYM
Group size 128
Zero point True
Calibration dataset OpenPlatypus, 512 samples
Max sequence length 1024
Tool llm-compressor
Format compressed-tensors

Calibration used real instruction-following data from OpenPlatypus rather than data-free quantization techniques.


Usage

With transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "kumar2235/Qwen3.5-4B-AWQ"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "Explain machine learning in one paragraph."

inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="kumar2235/Qwen3.5-4B-AWQ"
)

outputs = llm.generate(
    ["Explain machine learning in one paragraph."],
    SamplingParams(
        temperature=0.7,
        max_tokens=256
    )
)

print(outputs[0].outputs[0].text)

Sample output

Prompt:

Explain machine learning in one paragraph.

Response:

Machine learning is a branch of artificial intelligence that enables computers to learn patterns from data and improve their performance on tasks without being explicitly programmed for every situation. By analyzing large amounts of information, machine learning models can make predictions, classify data, recognize patterns, and support decision-making. It powers applications such as recommendation systems, image recognition, language translation, fraud detection, and autonomous systems.

Hardware

Component Specification
GPU (calibration) NVIDIA RTX 6000 Ada
GPU Memory 49 GB
CUDA 13.2
Quantization tool llm-compressor
Quantization method AWQ W4A16_ASYM
  • Weights: ~3.13 GB on disk, ~3.06 GB VRAM at load
  • Single-GPU friendly: comfortably fits on 8 GB+ consumer cards for local inference and edge deployment

Limitations

  • Benchmarks above are standalone scores for the quantized model; they have not yet been diffed against a BF16 run under identical harness settings, so the true accuracy delta from quantization is not yet confirmed
  • Calibration set was OpenPlatypus (English-leaning instruction data) — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
  • Max sequence length used during calibration was 1024 tokens; behavior at much longer contexts has not been separately validated

License

Inherits the license of the base model. See the Qwen/Qwen3.5-4B model page for terms.


Citation

Base model

@misc{qwen3.5-4b,
    title  = {{Qwen3.5-4B}},
    author = {{Qwen Team}},
    year   = {2025},
    url    = {https://huggingface.co/Qwen/Qwen3.5-4B}
}

Quantization method

@article{lin2023awq,
    title   = {{AWQ}: Activation-aware Weight Quantization for LLM Compression and Acceleration},
    author  = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
    journal = {arXiv preprint arXiv:2306.00978},
    year    = {2023}
}

Storage Format

This model uses the compressed-tensors format.

Hugging Face may display BF16/I32/I64 tensor types because compressed AWQ models store quantization metadata, scales, and packed weights separately. The model loads and runs as a compressed AWQ INT4 model through Transformers and llm-compressor.

Downloads last month
31
Safetensors
Model size
4B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kumar2235/Qwen3.5-4B-AWQ

Finetuned
Qwen/Qwen3.5-4B
Quantized
(254)
this model

Paper for kumar2235/Qwen3.5-4B-AWQ

Evaluation results