Instructions to use nakue/SmolLM2-1.7B-W4A16-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nakue/SmolLM2-1.7B-W4A16-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nakue/SmolLM2-1.7B-W4A16-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nakue/SmolLM2-1.7B-W4A16-instruct")
model = AutoModelForMultimodalLM.from_pretrained("nakue/SmolLM2-1.7B-W4A16-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nakue/SmolLM2-1.7B-W4A16-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nakue/SmolLM2-1.7B-W4A16-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nakue/SmolLM2-1.7B-W4A16-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nakue/SmolLM2-1.7B-W4A16-instruct

SGLang

How to use nakue/SmolLM2-1.7B-W4A16-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nakue/SmolLM2-1.7B-W4A16-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nakue/SmolLM2-1.7B-W4A16-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nakue/SmolLM2-1.7B-W4A16-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nakue/SmolLM2-1.7B-W4A16-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nakue/SmolLM2-1.7B-W4A16-instruct with Docker Model Runner:
```
docker model run hf.co/nakue/SmolLM2-1.7B-W4A16-instruct
```

SmolLM2-1.7B-W4A16-Instruct (INT4 Weight-Only Quantized)

A W4A16 (4-bit weights, 16-bit activations) quantized version of HuggingFaceTB/SmolLM2-1.7B-Instruct, produced using llm-compressor with the compressed-tensors format.

W4A16 is a weight-only quantization scheme — weights are stored in INT4 and dequantized to BF16 at runtime during the matrix multiply. This means memory bandwidth is the primary beneficiary (roughly 4x reduction in model size), while compute stays in BF16. This makes W4A16 ideal for memory-constrained or latency-sensitive single-user scenarios where fitting the model in VRAM is the bottleneck.

Model Details

Property	Value
Base model	HuggingFaceTB/SmolLM2-1.7B-Instruct
Architecture	LlamaForCausalLM
Parameters	~1.7B
Quantization scheme	W4A16 — INT4 weights, BF16 activations
Excluded layers	`lm_head` (kept in BF16)
Format	`compressed-tensors` (Safetensors)
Calibration dataset	`ultrachat` (512 samples, max_seq_length 2048)
Quantization tool	llm-compressor

W4A16 vs Other Schemes

Scheme	Weight bits	Activation bits	Memory saving	Compute speedup	Best for
BF16 (base)	16	16	—	—	Accuracy baseline
W4A16 (this model)	4	16	~4x	Memory-bound only	Small GPUs, low-latency single user
W8A16	8	16	~2x	Memory-bound only	Mild memory pressure
W8A8	8	8	~2x	✅ Compute (INT8 cores)	High-throughput batched serving

Key insight: W4A16 trades a bit more accuracy than W8A8 for a larger memory reduction (~4x vs ~2x). It does not use INT4 tensor cores at runtime — the dequantize-then-multiply pattern keeps compute in BF16. Choose W4A16 when fitting the model matters more than maximizing throughput.

How to Use

Option 1 — vLLM (recommended for serving)

pip install vllm

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"

llm = LLM(
    model=model_id,
    quantization="compressed-tensors",
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the difference between W4A16 and W8A8 quantization?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Option 2 — Transformers + compressed-tensors (no vLLM)

Install the compressed-tensors runtime — Transformers auto-detects the quantization config from config.json:

pip install compressed-tensors transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Note: W4A16 dequantizes weights to BF16 at runtime — compute stays in BF16. You get ~4x memory reduction; latency gains depend on whether your workload is memory-bandwidth-bound.

Option 3 — llmcompressor (same library used to quantize)

pip install llmcompressor

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = SparseAutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

inputs = tokenizer("Explain INT4 quantization:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 4 — Dequantize to BF16 (no quantization runtime)

If you need to run with plain Transformers and zero extra dependencies:

pip install llmcompressor

from llmcompressor.transformers import SparseAutoModelForCausalLM
import torch

model = SparseAutoModelForCausalLM.from_pretrained(
    "nakue/SmolLM2-1.7B-W4A16-instruct",
    torch_dtype=torch.bfloat16,
)
model.save_pretrained("smollm2-bf16-dequantized")
tokenizer.save_pretrained("smollm2-bf16-dequantized")

Then load smollm2-bf16-dequantized with plain AutoModelForCausalLM. Memory savings are lost but there are zero runtime dependencies.

Quantization Recipe

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer

model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],  # keep output projection in BF16
)

oneshot(
    model=model,
    dataset="ultrachat",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

model.save_pretrained("SmolLM2-1.7B-W4A16-instruct")
tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-instruct")

Evaluation

⚠️ Evaluation pending. Accuracy vs. the BF16 base has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.

Planned evaluation setup:

# BF16 baseline
lm_eval --model hf \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/baseline

# W4A16
lm_eval --model hf \
  --model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \
  --tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
  --num_fewshot 0 --batch_size 32 --output_path results/w4a16

Results table (to be filled):

Task	BF16 Base	W4A16 (this model)	Delta
HellaSwag (acc_norm)	—	—	—
WinoGrande (acc)	—	—	—
ARC-Easy (acc_norm)	—	—	—
ARC-Challenge (acc_norm)	—	—	—
PIQA (acc_norm)	—	—	—
WikiText-2 PPL ↓	—	—	—

Limitations

Weight-only quantization — activations remain in BF16; this does not use INT4 tensor cores. Runtime compute is BF16.
Static calibration on ultrachat — accuracy may degrade on domains far from the calibration distribution.
lm_head excluded — the output projection is kept in BF16 to preserve logit precision.
More accuracy loss than W8A8 — 4-bit weights introduce more quantization error than 8-bit. W8A8 is the better choice when accuracy is the priority and a ~2x memory saving is sufficient.
Evaluation pending — formal benchmarks against the BF16 base have not yet been run.

When to use this vs the W8A8 version

Situation	Recommended model
GPU with < 4GB VRAM	W4A16 (this model)
Maximum memory savings matter	W4A16 (this model)
Single-user low-latency inference	W4A16 (this model)
High-throughput batched serving	W8A8 version
Accuracy is the priority	W8A8 version
INT8 tensor core compute speedup needed	W8A8 version

Related Models

Model	Scheme	Size	Link
SmolLM2-1.7B-Instruct (base)	BF16	~3.4GB	HuggingFaceTB/SmolLM2-1.7B-Instruct
SmolLM2-1.7B-W8A8-Instruct	INT8 W+A	~1.7GB	nakue/SmolLM2-1.7B-W8A8-instruct
SmolLM2-1.7B-W4A16-Instruct	INT4 W	~0.85GB	This model

License

Apache 2.0 — inherited from the base model. See LICENSE.

Citation

@misc{smollm2,
  title  = {SmolLM2: When Smol Goes Big},
  author = {HuggingFaceTB},
  year   = {2024},
  url    = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}

Quantized by nakue · Portfolio · Part of an LLM inference optimization portfolio targeting production serving patterns.

Downloads last month: 61

Safetensors

Model size

2B params

Tensor type

I64

I32

BF16

Model tree for nakue/SmolLM2-1.7B-W4A16-instruct

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct

Quantized

(98)

this model