Gemma 3 270M (BigSmall compressed)

0.5 GB -> 0.33 GB (BF16). Lossless. Zero inference overhead. Any hardware.

Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.

Quick start

pip install bigsmall

Version compatibility: Models compressed with bigsmall 2.4.0+ may use container format v2 for high-kurtosis tensors and require bigsmall >= 2.4.0 to decompress. Run pip install --upgrade bigsmall to update.

import bigsmall
bigsmall.install_hook()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("wpferrell/gemma-3-270m-bigsmall")

Streaming loader -- run on any hardware

BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.

from bigsmall import StreamingLoader
from transformers import AutoModelForCausalLM

with StreamingLoader("wpferrell/gemma-3-270m-bigsmall", device="cuda") as loader:
    model = loader.load_model(AutoModelForCausalLM)

Your GPU	Models you can run
2 GB	GPT-2, Gemma 270M
4 GB	Llama 3.2 3B, Mistral 7B, Gemma 2B, Llama 3.1 8B
8 GB	Qwen 2.5 14B, Gemma 2 9B, Phi-3.5 Mini
12 GB	Qwen 2.5 32B, Gemma 3 12B
24 GB	Llama 70B, Qwen 72B, Gemma 3 27B, DeepSeek V4-Flash
CPU only	Everything -- slower but full quality

BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.

Why BigSmall vs DFloat11

	BigSmall	DFloat11
Inference overhead	None	~2x at batch=1
Hardware	CPU, Apple Silicon, AMD, any GPU	CUDA only
FP32 support	Yes	No
Fine-tuning safe	Yes	No
Streaming loader	Yes -- peak RAM < 2 GB	No

Compression stats

Original	Compressed	Ratio	Format	Verified
0.5 GB	0.33 GB	65%	BF16	md5 every tensor

GitHub: wpferrell/Bigsmall
All models: huggingface.co/wpferrell

Downloads last month: 76

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support