SmolLM2-1.7B-W4A16-Instruct (INT4 Weight-Only Quantized)

A W4A16 (4-bit weights, 16-bit activations) quantized version of HuggingFaceTB/SmolLM2-1.7B-Instruct, produced using llm-compressor with the compressed-tensors format.

W4A16 is a weight-only quantization scheme β€” weights are stored in INT4 and dequantized to BF16 at runtime during the matrix multiply. This means memory bandwidth is the primary beneficiary (roughly 4x reduction in model size), while compute stays in BF16. This makes W4A16 ideal for memory-constrained or latency-sensitive single-user scenarios where fitting the model in VRAM is the bottleneck.


Model Details

Property Value
Base model HuggingFaceTB/SmolLM2-1.7B-Instruct
Architecture LlamaForCausalLM
Parameters ~1.7B
Quantization scheme W4A16 β€” INT4 weights, BF16 activations
Excluded layers lm_head (kept in BF16)
Format compressed-tensors (Safetensors)
Calibration dataset ultrachat (512 samples, max_seq_length 2048)
Quantization tool llm-compressor

W4A16 vs Other Schemes

Scheme Weight bits Activation bits Memory saving Compute speedup Best for
BF16 (base) 16 16 β€” β€” Accuracy baseline
W4A16 (this model) 4 16 ~4x Memory-bound only Small GPUs, low-latency single user
W8A16 8 16 ~2x Memory-bound only Mild memory pressure
W8A8 8 8 ~2x βœ… Compute (INT8 cores) High-throughput batched serving

Key insight: W4A16 trades a bit more accuracy than W8A8 for a larger memory reduction (~4x vs ~2x). It does not use INT4 tensor cores at runtime β€” the dequantize-then-multiply pattern keeps compute in BF16. Choose W4A16 when fitting the model matters more than maximizing throughput.


How to Use

Option 1 β€” vLLM (recommended for serving)

pip install vllm
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"

llm = LLM(
    model=model_id,
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the difference between W4A16 and W8A8 quantization?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Option 2 β€” Transformers + compressed-tensors (no vLLM)

Install the compressed-tensors runtime β€” Transformers auto-detects the quantization config from config.json:

pip install compressed-tensors transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Note: W4A16 dequantizes weights to BF16 at runtime β€” compute stays in BF16. You get ~4x memory reduction; latency gains depend on whether your workload is memory-bandwidth-bound.


Option 3 β€” llmcompressor (same library used to quantize)

pip install llmcompressor
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = SparseAutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

inputs = tokenizer("Explain INT4 quantization:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 4 β€” Dequantize to BF16 (no quantization runtime)

If you need to run with plain Transformers and zero extra dependencies:

pip install llmcompressor
from llmcompressor.transformers import SparseAutoModelForCausalLM
import torch

model = SparseAutoModelForCausalLM.from_pretrained(
    "nakue/SmolLM2-1.7B-W4A16-instruct",
    torch_dtype=torch.bfloat16,
)
model.save_pretrained("smollm2-bf16-dequantized")
tokenizer.save_pretrained("smollm2-bf16-dequantized")

Then load smollm2-bf16-dequantized with plain AutoModelForCausalLM. Memory savings are lost but there are zero runtime dependencies.


Quantization Recipe

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer

model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],  # keep output projection in BF16
)

oneshot(
    model=model,
    dataset="ultrachat",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

model.save_pretrained("SmolLM2-1.7B-W4A16-instruct")
tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-instruct")

Evaluation

⚠️ Evaluation pending. Accuracy vs. the BF16 base has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.

Planned evaluation setup:

# BF16 baseline
lm_eval --model hf \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/baseline

# W4A16
lm_eval --model hf \
  --model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/w4a16

Results table (to be filled):

Task BF16 Base W4A16 (this model) Delta
HellaSwag (acc_norm) β€” β€” β€”
WinoGrande (acc) β€” β€” β€”
ARC-Easy (acc_norm) β€” β€” β€”
ARC-Challenge (acc_norm) β€” β€” β€”
PIQA (acc_norm) β€” β€” β€”
WikiText-2 PPL ↓ β€” β€” β€”

Limitations

  • Weight-only quantization β€” activations remain in BF16; this does not use INT4 tensor cores. Runtime compute is BF16.
  • Static calibration on ultrachat β€” accuracy may degrade on domains far from the calibration distribution.
  • lm_head excluded β€” the output projection is kept in BF16 to preserve logit precision.
  • More accuracy loss than W8A8 β€” 4-bit weights introduce more quantization error than 8-bit. W8A8 is the better choice when accuracy is the priority and a ~2x memory saving is sufficient.
  • Evaluation pending β€” formal benchmarks against the BF16 base have not yet been run.

When to use this vs the W8A8 version

Situation Recommended model
GPU with < 4GB VRAM W4A16 (this model)
Maximum memory savings matter W4A16 (this model)
Single-user low-latency inference W4A16 (this model)
High-throughput batched serving W8A8 version
Accuracy is the priority W8A8 version
INT8 tensor core compute speedup needed W8A8 version

Related Models

Model Scheme Size Link
SmolLM2-1.7B-Instruct (base) BF16 ~3.4GB HuggingFaceTB/SmolLM2-1.7B-Instruct
SmolLM2-1.7B-W8A8-Instruct INT8 W+A ~1.7GB nakue/SmolLM2-1.7B-W8A8-instruct
SmolLM2-1.7B-W4A16-Instruct INT4 W ~0.85GB This model

License

Apache 2.0 β€” inherited from the base model. See LICENSE.


Citation

@misc{smollm2,
  title  = {SmolLM2: When Smol Goes Big},
  author = {HuggingFaceTB},
  year   = {2024},
  url    = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue Β· Portfolio Β· Part of an LLM inference optimization portfolio targeting production serving patterns.

Downloads last month
61
Safetensors
Model size
2B params
Tensor type
I64
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nakue/SmolLM2-1.7B-W4A16-instruct

Quantized
(98)
this model