Qwen 3 4B Instruct (BigSmall compressed)
8.06 GB -> 4.95 GB (BF16). Lossless. Zero inference overhead. Any hardware.
Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.
Why BigSmall
vs DFloat11
| BigSmall | DFloat11 | |
|---|---|---|
| Inference overhead | None | ~2x at batch=1 |
| Hardware | CPU, Apple Silicon, AMD, any GPU | CUDA only |
| Fine-tuning safe | Yes | No |
vs quantization
Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.
Install & Load
pip install bigsmall
import bigsmall
bigsmall.install_hook()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("wpferrell/qwen3-4b-instruct-bigsmall")
Stats
| Original | Compressed | Ratio | Format |
|---|---|---|---|
| 8.06 GB | 4.95 GB | 61.4% | BF16 |
- GitHub: wpferrell/Bigsmall
- All models: huggingface.co/wpferrell
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support