OLMo-3-7B-Think-VAC

A structurally compressed version of OLMo-3-7B-Think using Variable Allocation Compression (VAC).

This model has the same architecture as OLMo-3-7B-Think but with each linear layer factorized into two smaller matrices, reducing storage by 1.8x and inference FLOPs by ~1.8x.

Property Value
Base model allenai/OLMo-3-7B-Think
Compression method VAC (Variable Allocation Compression)
Compression ratio 1.8x
Download size ~8.9 GB (vs 14.6 GB original)
VRAM (bf16) ~8.9 GB (fits 12 GB GPUs)
VRAM (INT8) ~4.5 GB (fits 8 GB GPUs)
Inference speed ~1.8x faster than original
C4 PPL 26.97 (original: 21.05)

Usage

Requires transformers and trust_remote_code=True:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# bf16 โ€” requires 12+ GB GPU (RTX 3080, 4070, A10G, etc.)
model = AutoModelForCausalLM.from_pretrained(
    "asystemoffields/OLMo-3-7B-Think-VAC",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# INT8 โ€” requires 8+ GB GPU (RTX 3060, 4060, etc.)
# model = AutoModelForCausalLM.from_pretrained(
#     "asystemoffields/OLMo-3-7B-Think-VAC",
#     trust_remote_code=True,
#     load_in_8bit=True,
# )

tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-3-7B-Think")

messages = [{"role": "user", "content": "What is 38 + 47? Show your work."}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
)
output = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=False))

The model generates <think>...reasoning...</think> before its answer, just like the original OLMo-3-7B-Think. Set max_new_tokens to at least 1024 for complete responses (the thinking block can be long).

What is VAC?

Variable Allocation Compression replaces each dense linear layer with two smaller factor matrices (down and up), where W โ‰ˆ up @ down. The rank of each factorization is allocated per-matrix using Fisher information and a knapsack solver โ€” important matrices get more rank, redundant ones get less.

The compression strategy was discovered by evolutionary search over compression order, Fisher scaling exponent, and per-component allocation. Key findings:

  • Middle-out compression order: compress easy middle layers first
  • Cube-root Fisher exponent: gentler than sqrt, avoids over-trusting the Fisher approximation
  • Attention-heavy allocation: attention tolerates 4x compression; MLP is a super sensitive component

How It Differs from Quantization

Quantization (GPTQ, AWQ) VAC
What it reduces Bits per weight Number of weights
FLOPs Same as original ~1.8x fewer
Inference speed Same (or slight bandwidth win) ~1.8x faster
Stacks with quant? N/A Yes (INT8 on factored weights)

VAC and quantization are orthogonal. You can quantize the factored matrices for additional savings.

Limitations

  • No GGUF/Ollama/LM Studio support. The factorized layer format is not supported by llama.cpp. This model runs via HuggingFace Transformers only.
  • Requires trust_remote_code=True โ€” the factorized layer class is defined in modeling_pmre_olmo.py shipped with this repo.
  • ~16 GB system RAM required for loading (model loads to CPU first, then moves to GPU).
  • ~6 PPL gap from the original on C4 evaluation. For interactive use this is generally imperceptible, but may be measurable on precise benchmarks.

Method Details

  • Compression: Sequential Fisher-weighted SVD with evolved middle-out order and cube-root exponent
  • Recovery: Knowledge distillation on DOLMA (OLMo's training data) with 20% Think-completion interleave
  • Post-training: Dolci-Think-SFT replay (instruction tuning with <think> traces)
  • Attention tuning: Differential learning rate KD (attention at 10x higher LR than MLP) to recover routing quality

Full technical details: github.com/asystemoffields/v-a-c

Acknowledgments

  • Allen AI for OLMo-3-7B-Think and their commitment to open science โ€” full training data (DOLMA), post-training data (Dolci), evaluation infrastructure (OLMES), and every intermediate checkpoint published openly.
  • Method: VAC (Variable Allocation Compression)

License

Apache 2.0 (same as the base model).

Downloads last month
37
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including Asystemoffields/OLMo-3-7B-Think-VAC