Qwen 3 4B Instruct (BigSmall compressed)

8.06 GB -> 4.95 GB (BF16). Lossless. Zero inference overhead. Any hardware.

Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.

Why BigSmall

vs DFloat11

	BigSmall	DFloat11
Inference overhead	None	~2x at batch=1
Hardware	CPU, Apple Silicon, AMD, any GPU	CUDA only
Fine-tuning safe	Yes	No

vs quantization

Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.

Install & Load

pip install bigsmall

import bigsmall
bigsmall.install_hook()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("wpferrell/qwen3-4b-instruct-bigsmall")

Stats

Original	Compressed	Ratio	Format
8.06 GB	4.95 GB	61.4%	BF16

GitHub: wpferrell/Bigsmall
All models: huggingface.co/wpferrell

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support