Gemma 3 270M (BigSmall compressed)

0.5 GB -> 0.33 GB (BF16). Lossless. Zero inference overhead. Any hardware.

Compressed with BigSmall -- decompresses once at load time, runs at full native speed. Every weight is bit-identical to the original.

Quick start

pip install bigsmall

Version compatibility: Models compressed with bigsmall 2.4.0+ may use container format v2 for high-kurtosis tensors and require bigsmall >= 2.4.0 to decompress. Run pip install --upgrade bigsmall to update.

import bigsmall
bigsmall.install_hook()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("wpferrell/gemma-3-270m-bigsmall")

Streaming loader -- run on any hardware

BigSmall's streaming loader decompresses one layer at a time directly into VRAM. Peak memory is one layer -- not the whole model. A 4 GB GPU can run Mistral 7B losslessly.

from bigsmall import StreamingLoader
from transformers import AutoModelForCausalLM

with StreamingLoader("wpferrell/gemma-3-270m-bigsmall", device="cuda") as loader:
    model = loader.load_model(AutoModelForCausalLM)
Your GPU Models you can run
2 GB GPT-2, Gemma 270M
4 GB Llama 3.2 3B, Mistral 7B, Gemma 2B, Llama 3.1 8B
8 GB Qwen 2.5 14B, Gemma 2 9B, Phi-3.5 Mini
12 GB Qwen 2.5 32B, Gemma 3 12B
24 GB Llama 70B, Qwen 72B, Gemma 3 27B, DeepSeek V4-Flash
CPU only Everything -- slower but full quality

BigSmall is the only lossless compression tool with a streaming loader. DFloat11 and ZipNN load the full model into memory.

Why BigSmall vs DFloat11

BigSmall DFloat11
Inference overhead None ~2x at batch=1
Hardware CPU, Apple Silicon, AMD, any GPU CUDA only
FP32 support Yes No
Fine-tuning safe Yes No
Streaming loader Yes -- peak RAM < 2 GB No

Compression stats

Original Compressed Ratio Format Verified
0.5 GB 0.33 GB 65% BF16 md5 every tensor
Downloads last month
76
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support