Mistral 7B Instruct v0.2 (BigSmall compressed)
14.5 GB -> 8.86 GB (BF16). Lossless. Zero inference overhead. Any hardware.
Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.
Quick start
ash pip install bigsmall
python import bigsmall bigsmall.install_hook() from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("wpferrell/mistral-7b-instruct-v0.2-bigsmall")
Streaming loader -- run on any hardware
BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.
`python from bigsmall import StreamingLoader from transformers import AutoModelForCausalLM
with StreamingLoader("wpferrell/mistral-7b-instruct-v0.2-bigsmall", device="cuda") as loader: model = loader.load_model(AutoModelForCausalLM) `
| Your GPU | Models you can run |
|---|---|
| 2 GB | Small models, GPT-2, Gemma 270M |
| 4 GB | Mistral 7B, Llama 3.1 8B, Gemma 2B, Llama 3.2 3B |
| 8 GB | Qwen 2.5 14B, Gemma 2 9B |
| 24 GB | Llama 70B, Qwen 72B, DeepSeek V4-Flash |
| CPU only | Everything -- slower but full quality |
BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.
Why BigSmall vs DFloat11
| BigSmall | DFloat11 | |
|---|---|---|
| Inference overhead | None | ~2x at batch=1 |
| Hardware | CPU, Apple Silicon, AMD, any GPU | CUDA only |
| FP32 support | Yes | No |
| Fine-tuning safe | Yes | No |
| Streaming loader | Yes -- peak RAM < 2 GB | No |
Why BigSmall vs quantization
Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.
Compression stats
| Original | Compressed | Ratio | Format | Verified |
|---|---|---|---|---|
| 14.5 GB | 8.86 GB | 61.1% | BF16 | md5 every tensor |
- GitHub: wpferrell/Bigsmall
- All models: huggingface.co/wpferrell
- Downloads last month
- -