Mistral 7B Instruct v0.2 (BigSmall compressed)

14.5 GB -> 8.86 GB (BF16). Lossless. Zero inference overhead. Any hardware.

Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.

Quick start

ash pip install bigsmall

python import bigsmall bigsmall.install_hook() from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("wpferrell/mistral-7b-instruct-v0.2-bigsmall")

Streaming loader -- run on any hardware

BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.

`python from bigsmall import StreamingLoader from transformers import AutoModelForCausalLM

with StreamingLoader("wpferrell/mistral-7b-instruct-v0.2-bigsmall", device="cuda") as loader: model = loader.load_model(AutoModelForCausalLM) `

Your GPU	Models you can run
2 GB	Small models, GPT-2, Gemma 270M
4 GB	Mistral 7B, Llama 3.1 8B, Gemma 2B, Llama 3.2 3B
8 GB	Qwen 2.5 14B, Gemma 2 9B
24 GB	Llama 70B, Qwen 72B, DeepSeek V4-Flash
CPU only	Everything -- slower but full quality

BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.

Why BigSmall vs DFloat11

	BigSmall	DFloat11
Inference overhead	None	~2x at batch=1
Hardware	CPU, Apple Silicon, AMD, any GPU	CUDA only
FP32 support	Yes	No
Fine-tuning safe	Yes	No
Streaming loader	Yes -- peak RAM < 2 GB	No

Why BigSmall vs quantization

Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.

Compression stats

Original	Compressed	Ratio	Format	Verified
14.5 GB	8.86 GB	61.1%	BF16	md5 every tensor

GitHub: wpferrell/Bigsmall
All models: huggingface.co/wpferrell

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support