Trellis
GGUF
quantization
Mixture of Experts
olmoe
qtip
2-bit

OLMoE-1B-7B-0125 — QTIP 2-Bit (W2A16)

2-bit weight-only quantization of allenai/OLMoE-1B-7B-0125 via per-expert trellis-coded quantization (QTIP BlockLDLQ with per-expert Hessian calibration).

Key Numbers

Metric Value
Model size on disk 2.47 GB
GPU VRAM (including KV cache) 2.7 GB
Generation speed 13 tok/s (RTX 4080 Laptop)
Prompt processing 32 tok/s
WikiText-2 PPL 9.09 (fp16: 6.65, ratio 1.367x)
C4 PPL 14.16 (fp16: 12.24, ratio 1.157x)
HellaSwag acc_norm 71.15% (fp16: 78.26%, retention 90.9%)
PIQA acc_norm 77.97% (fp16: 79.71%, retention 97.8%)
ARC-Challenge acc_norm 44.28% (fp16: 49.06%, retention 90.3%)

What This Is

A 7-billion-parameter Mixture-of-Experts model compressed to 2 bits per weight using QTIP's trellis-coded quantization with routing-conditioned per-expert Hessian calibration. The model fits entirely in GPU memory on devices with as little as 4 GB VRAM and generates at 13 tok/s on a laptop GPU.

This is a base model (not instruction-tuned). It performs text completion, not chat.

How to Run

Requires our llama.cpp fork with QTIP 2-bit support. Important: use the qtip-olmoe-2bit branch.

With CUDA (GPU inference):

git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  -p "Mixture of Experts models use a routing mechanism to" \
  -n 100 --temp 0.7 --repeat-penalty 1.1

CPU-only (no CUDA required):

git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  -p "Your prompt here" -n 100

Expert offload (ultra-low-VRAM devices, ~4 tok/s):

./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  --qtip-expert-offload \
  -p "Your prompt here" -n 100

Method

We collect routing-conditioned per-expert input Hessians from only the tokens each expert actually receives during a calibration pass, producing 2048 distinct expert Hessians for the full model. These feed into an unmodified QTIP BlockLDLQ pipeline (HYB bitshift code, L=16, V=2, Tx=Ty=16, Q=9) with random Hadamard transform preprocessing. No LUT fine-tuning, no codebook modifications.

Quantization Details

  • Quantization method: QTIP BlockLDLQ with per-expert Hessian calibration
  • Bits per weight: ~2.125 (2 bits + trellis overhead)
  • Calibration data: 2048 sequences x 1024 tokens from C4 English train
  • Attention quantization: Same 2-bit method, shared Hessian (not routing-conditioned)
  • Router and embeddings: Kept in f32

Limitations

  • This is a base model. For chat/instruction-following, use an instruction-tuned variant (not yet available at 2-bit).
  • Generation quality is noticeably degraded compared to fp16 on complex reasoning tasks (see ARC-Challenge retention of 90.3%).
  • Expert offload mode runs at ~4 tok/s due to CPU-GPU transfer overhead.

Citation

Technical report forthcoming on arxiv.

License

Apache 2.0 (same as the base OLMoE model).

Acknowledgments

Built on QTIP (Tseng et al., NeurIPS 2024) and OLMoE (Muennighoff et al., 2024).

Downloads last month
21
GGUF
Model size
7B params
Architecture
olmoe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit