Instructions to use nakue/SmolLM2-1.7B-W4A16-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nakue/SmolLM2-1.7B-W4A16-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nakue/SmolLM2-1.7B-W4A16-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nakue/SmolLM2-1.7B-W4A16-instruct") model = AutoModelForMultimodalLM.from_pretrained("nakue/SmolLM2-1.7B-W4A16-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nakue/SmolLM2-1.7B-W4A16-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nakue/SmolLM2-1.7B-W4A16-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nakue/SmolLM2-1.7B-W4A16-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nakue/SmolLM2-1.7B-W4A16-instruct
- SGLang
How to use nakue/SmolLM2-1.7B-W4A16-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nakue/SmolLM2-1.7B-W4A16-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nakue/SmolLM2-1.7B-W4A16-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nakue/SmolLM2-1.7B-W4A16-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nakue/SmolLM2-1.7B-W4A16-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nakue/SmolLM2-1.7B-W4A16-instruct with Docker Model Runner:
docker model run hf.co/nakue/SmolLM2-1.7B-W4A16-instruct
SmolLM2-1.7B-W4A16-Instruct (INT4 Weight-Only Quantized)
A W4A16 (4-bit weights, 16-bit activations) quantized version of HuggingFaceTB/SmolLM2-1.7B-Instruct, produced using llm-compressor with the compressed-tensors format.
W4A16 is a weight-only quantization scheme β weights are stored in INT4 and dequantized to BF16 at runtime during the matrix multiply. This means memory bandwidth is the primary beneficiary (roughly 4x reduction in model size), while compute stays in BF16. This makes W4A16 ideal for memory-constrained or latency-sensitive single-user scenarios where fitting the model in VRAM is the bottleneck.
Model Details
| Property | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM2-1.7B-Instruct |
| Architecture | LlamaForCausalLM |
| Parameters | ~1.7B |
| Quantization scheme | W4A16 β INT4 weights, BF16 activations |
| Excluded layers | lm_head (kept in BF16) |
| Format | compressed-tensors (Safetensors) |
| Calibration dataset | ultrachat (512 samples, max_seq_length 2048) |
| Quantization tool | llm-compressor |
W4A16 vs Other Schemes
| Scheme | Weight bits | Activation bits | Memory saving | Compute speedup | Best for |
|---|---|---|---|---|---|
| BF16 (base) | 16 | 16 | β | β | Accuracy baseline |
| W4A16 (this model) | 4 | 16 | ~4x | Memory-bound only | Small GPUs, low-latency single user |
| W8A16 | 8 | 16 | ~2x | Memory-bound only | Mild memory pressure |
| W8A8 | 8 | 8 | ~2x | β Compute (INT8 cores) | High-throughput batched serving |
Key insight: W4A16 trades a bit more accuracy than W8A8 for a larger memory reduction (~4x vs ~2x). It does not use INT4 tensor cores at runtime β the dequantize-then-multiply pattern keeps compute in BF16. Choose W4A16 when fitting the model matters more than maximizing throughput.
How to Use
Option 1 β vLLM (recommended for serving)
pip install vllm
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
llm = LLM(
model=model_id,
quantization="compressed-tensors",
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the difference between W4A16 and W8A8 quantization?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Option 2 β Transformers + compressed-tensors (no vLLM)
Install the compressed-tensors runtime β Transformers auto-detects the quantization config from config.json:
pip install compressed-tensors transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantization in simple terms."},
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Note: W4A16 dequantizes weights to BF16 at runtime β compute stays in BF16. You get ~4x memory reduction; latency gains depend on whether your workload is memory-bandwidth-bound.
Option 3 β llmcompressor (same library used to quantize)
pip install llmcompressor
from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "nakue/SmolLM2-1.7B-W4A16-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
inputs = tokenizer("Explain INT4 quantization:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Option 4 β Dequantize to BF16 (no quantization runtime)
If you need to run with plain Transformers and zero extra dependencies:
pip install llmcompressor
from llmcompressor.transformers import SparseAutoModelForCausalLM
import torch
model = SparseAutoModelForCausalLM.from_pretrained(
"nakue/SmolLM2-1.7B-W4A16-instruct",
torch_dtype=torch.bfloat16,
)
model.save_pretrained("smollm2-bf16-dequantized")
tokenizer.save_pretrained("smollm2-bf16-dequantized")
Then load smollm2-bf16-dequantized with plain AutoModelForCausalLM. Memory savings are lost but there are zero runtime dependencies.
Quantization Recipe
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer
model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
recipe = QuantizationModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"], # keep output projection in BF16
)
oneshot(
model=model,
dataset="ultrachat",
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512,
)
model.save_pretrained("SmolLM2-1.7B-W4A16-instruct")
tokenizer.save_pretrained("SmolLM2-1.7B-W4A16-instruct")
Evaluation
β οΈ Evaluation pending. Accuracy vs. the BF16 base has not yet been formally benchmarked. Results will be added once lm-evaluation-harness evals complete.
Planned evaluation setup:
# BF16 baseline
lm_eval --model hf \
--model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,dtype=bfloat16" \
--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
--num_fewshot 0 --batch_size 32 --output_path results/baseline
# W4A16
lm_eval --model hf \
--model_args "pretrained=nakue/SmolLM2-1.7B-W4A16-instruct,dtype=bfloat16" \
--tasks hellaswag,winogrande,arc_easy,arc_challenge,piqa,wikitext \
--num_fewshot 0 --batch_size 32 --output_path results/w4a16
Results table (to be filled):
| Task | BF16 Base | W4A16 (this model) | Delta |
|---|---|---|---|
| HellaSwag (acc_norm) | β | β | β |
| WinoGrande (acc) | β | β | β |
| ARC-Easy (acc_norm) | β | β | β |
| ARC-Challenge (acc_norm) | β | β | β |
| PIQA (acc_norm) | β | β | β |
| WikiText-2 PPL β | β | β | β |
Limitations
- Weight-only quantization β activations remain in BF16; this does not use INT4 tensor cores. Runtime compute is BF16.
- Static calibration on
ultrachatβ accuracy may degrade on domains far from the calibration distribution. lm_headexcluded β the output projection is kept in BF16 to preserve logit precision.- More accuracy loss than W8A8 β 4-bit weights introduce more quantization error than 8-bit. W8A8 is the better choice when accuracy is the priority and a ~2x memory saving is sufficient.
- Evaluation pending β formal benchmarks against the BF16 base have not yet been run.
When to use this vs the W8A8 version
| Situation | Recommended model |
|---|---|
| GPU with < 4GB VRAM | W4A16 (this model) |
| Maximum memory savings matter | W4A16 (this model) |
| Single-user low-latency inference | W4A16 (this model) |
| High-throughput batched serving | W8A8 version |
| Accuracy is the priority | W8A8 version |
| INT8 tensor core compute speedup needed | W8A8 version |
Related Models
| Model | Scheme | Size | Link |
|---|---|---|---|
| SmolLM2-1.7B-Instruct (base) | BF16 | ~3.4GB | HuggingFaceTB/SmolLM2-1.7B-Instruct |
| SmolLM2-1.7B-W8A8-Instruct | INT8 W+A | ~1.7GB | nakue/SmolLM2-1.7B-W8A8-instruct |
| SmolLM2-1.7B-W4A16-Instruct | INT4 W | ~0.85GB | This model |
License
Apache 2.0 β inherited from the base model. See LICENSE.
Citation
@misc{smollm2,
title = {SmolLM2: When Smol Goes Big},
author = {HuggingFaceTB},
year = {2024},
url = {https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct}
}
Quantized by nakue Β· Portfolio Β· Part of an LLM inference optimization portfolio targeting production serving patterns.
- Downloads last month
- 61
Model tree for nakue/SmolLM2-1.7B-W4A16-instruct
Base model
HuggingFaceTB/SmolLM2-1.7B