Mistral 7B Instruct v0.2 (BigSmall compressed)

14.5 GB -> 8.86 GB (BF16). Lossless. Zero inference overhead. Any hardware.

Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.

Quick start

ash pip install bigsmall

python import bigsmall bigsmall.install_hook() from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("wpferrell/mistral-7b-instruct-v0.2-bigsmall")

Streaming loader -- run on any hardware

BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.

`python from bigsmall import StreamingLoader from transformers import AutoModelForCausalLM

with StreamingLoader("wpferrell/mistral-7b-instruct-v0.2-bigsmall", device="cuda") as loader: model = loader.load_model(AutoModelForCausalLM) `

Your GPU Models you can run
2 GB Small models, GPT-2, Gemma 270M
4 GB Mistral 7B, Llama 3.1 8B, Gemma 2B, Llama 3.2 3B
8 GB Qwen 2.5 14B, Gemma 2 9B
24 GB Llama 70B, Qwen 72B, DeepSeek V4-Flash
CPU only Everything -- slower but full quality

BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.

Why BigSmall vs DFloat11

BigSmall DFloat11
Inference overhead None ~2x at batch=1
Hardware CPU, Apple Silicon, AMD, any GPU CUDA only
FP32 support Yes No
Fine-tuning safe Yes No
Streaming loader Yes -- peak RAM < 2 GB No

Why BigSmall vs quantization

Lossless -- bit-identical weights, no accuracy loss, fine-tuning safe, reproducible outputs.

Compression stats

Original Compressed Ratio Format Verified
14.5 GB 8.86 GB 61.1% BF16 md5 every tensor
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support