Instructions to use kumar2235/Qwen3.5-4B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kumar2235/Qwen3.5-4B-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kumar2235/Qwen3.5-4B-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("kumar2235/Qwen3.5-4B-AWQ") model = AutoModelForMultimodalLM.from_pretrained("kumar2235/Qwen3.5-4B-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kumar2235/Qwen3.5-4B-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kumar2235/Qwen3.5-4B-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kumar2235/Qwen3.5-4B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kumar2235/Qwen3.5-4B-AWQ
- SGLang
How to use kumar2235/Qwen3.5-4B-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kumar2235/Qwen3.5-4B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kumar2235/Qwen3.5-4B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kumar2235/Qwen3.5-4B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kumar2235/Qwen3.5-4B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kumar2235/Qwen3.5-4B-AWQ with Docker Model Runner:
docker model run hf.co/kumar2235/Qwen3.5-4B-AWQ
Qwen3.5-4B-AWQ
AWQ INT4 quantization of
Qwen/Qwen3.5-4Busing llm-compressor and 512 OpenPlatypus calibration samples.2.56× smaller on disk and ~61.7% lower VRAM usage while maintaining strong benchmark performance.
Model compression
| BF16 baseline | AWQ INT4 (this model) | |
|---|---|---|
| Model size | ~8.0 GB | ~3.13 GB (2.56x smaller) |
| VRAM at load | ~8.0 GB | ~3.06 GB (2.61x smaller) |
| Bits / weight | 16 | 4 (4x fewer) |
Benchmarks
Note: these are the quantized model's standalone scores from EleutherAI lm-evaluation-harness, default settings, 0-shot. HellaSwag and ARC-Challenge use
acc_norm; PIQA, Winogrande, and ARC-Easy useacc, matching each task's harness default. A matched BF16-vs-INT4 delta on identical hardware and settings has not yet been run for this model; treat the scores below as standalone results rather than a verified quantization delta.
| Benchmark | Metric | Score |
|---|---|---|
| PIQA | acc |
77.69 |
| Winogrande | acc |
68.75 |
| HellaSwag | acc_norm |
71.65 |
| ARC-Easy | acc |
73.53 |
| ARC-Challenge | acc_norm |
51.71 |
Average score: 68.67%
Quantization recipe
| Setting | Value |
|---|---|
| Method | AWQ |
| Scheme | W4A16_ASYM |
| Group size | 128 |
| Zero point | True |
| Calibration dataset | OpenPlatypus, 512 samples |
| Max sequence length | 1024 |
| Tool | llm-compressor |
| Format | compressed-tensors |
Calibration used real instruction-following data from OpenPlatypus rather than data-free quantization techniques.
Usage
With transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
MODEL_ID = "kumar2235/Qwen3.5-4B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "Explain machine learning in one paragraph."
inputs = tokenizer(
prompt,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="kumar2235/Qwen3.5-4B-AWQ"
)
outputs = llm.generate(
["Explain machine learning in one paragraph."],
SamplingParams(
temperature=0.7,
max_tokens=256
)
)
print(outputs[0].outputs[0].text)
Sample output
Prompt:
Explain machine learning in one paragraph.
Response:
Machine learning is a branch of artificial intelligence that enables computers to learn patterns from data and improve their performance on tasks without being explicitly programmed for every situation. By analyzing large amounts of information, machine learning models can make predictions, classify data, recognize patterns, and support decision-making. It powers applications such as recommendation systems, image recognition, language translation, fraud detection, and autonomous systems.
Hardware
| Component | Specification |
|---|---|
| GPU (calibration) | NVIDIA RTX 6000 Ada |
| GPU Memory | 49 GB |
| CUDA | 13.2 |
| Quantization tool | llm-compressor |
| Quantization method | AWQ W4A16_ASYM |
- Weights: ~3.13 GB on disk, ~3.06 GB VRAM at load
- Single-GPU friendly: comfortably fits on 8 GB+ consumer cards for local inference and edge deployment
Limitations
- Benchmarks above are standalone scores for the quantized model; they have not yet been diffed against a BF16 run under identical harness settings, so the true accuracy delta from quantization is not yet confirmed
- Calibration set was OpenPlatypus (English-leaning instruction data) — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
- Max sequence length used during calibration was 1024 tokens; behavior at much longer contexts has not been separately validated
License
Inherits the license of the base model. See the Qwen/Qwen3.5-4B model page for terms.
Citation
Base model
@misc{qwen3.5-4b,
title = {{Qwen3.5-4B}},
author = {{Qwen Team}},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3.5-4B}
}
Quantization method
@article{lin2023awq,
title = {{AWQ}: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
journal = {arXiv preprint arXiv:2306.00978},
year = {2023}
}
Storage Format
This model uses the compressed-tensors format.
Hugging Face may display BF16/I32/I64 tensor types because compressed AWQ models store quantization metadata, scales, and packed weights separately. The model loads and runs as a compressed AWQ INT4 model through Transformers and llm-compressor.
- Downloads last month
- 31
Model tree for kumar2235/Qwen3.5-4B-AWQ
Paper for kumar2235/Qwen3.5-4B-AWQ
Evaluation results
- acc on PIQAself-reported77.690
- acc on Winograndeself-reported68.750
- acc_norm on HellaSwagself-reported71.650
- acc on ARC-Easyself-reported73.530
- acc_norm on ARC-Challengeself-reported51.710

