Instructions to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Trellis:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

llama-cpp-python

How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit",
	filename="olmoe-qtip-2b-v2.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
# Run inference directly in the terminal:
llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
# Run inference directly in the terminal:
llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
# Run inference directly in the terminal:
./llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

Use Docker

docker model run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

LM Studio
Jan
Ollama
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Ollama:
```
ollama run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
```

Unsloth Studio new

How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting

Docker Model Runner
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Docker Model Runner:
```
docker model run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
```

Lemonade

How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

Run and chat with the model

lemonade run user.OLMoE-1B-7B-0125-QTIP-2bit-{{QUANT_TAG}}

List all available models

lemonade list

OLMoE-1B-7B-0125 — QTIP 2-Bit (W2A16)

2-bit weight-only quantization of allenai/OLMoE-1B-7B-0125 via per-expert trellis-coded quantization (QTIP BlockLDLQ with per-expert Hessian calibration).

Key Numbers

Metric	Value
Model size on disk	2.47 GB
GPU VRAM (including KV cache)	2.7 GB
Generation speed	13 tok/s (RTX 4080 Laptop)
Prompt processing	32 tok/s
WikiText-2 PPL	9.09 (fp16: 6.65, ratio 1.367x)
C4 PPL	14.16 (fp16: 12.24, ratio 1.157x)
HellaSwag acc_norm	71.15% (fp16: 78.26%, retention 90.9%)
PIQA acc_norm	77.97% (fp16: 79.71%, retention 97.8%)
ARC-Challenge acc_norm	44.28% (fp16: 49.06%, retention 90.3%)

What This Is

A 7-billion-parameter Mixture-of-Experts model compressed to 2 bits per weight using QTIP's trellis-coded quantization with routing-conditioned per-expert Hessian calibration. The model fits entirely in GPU memory on devices with as little as 4 GB VRAM and generates at 13 tok/s on a laptop GPU.

This is a base model (not instruction-tuned). It performs text completion, not chat.

How to Run

Requires our llama.cpp fork with QTIP 2-bit support. Important: use the qtip-olmoe-2bit branch.

With CUDA (GPU inference):

git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  -p "Mixture of Experts models use a routing mechanism to" \
  -n 100 --temp 0.7 --repeat-penalty 1.1

CPU-only (no CUDA required):

git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  -p "Your prompt here" -n 100

Expert offload (ultra-low-VRAM devices, ~4 tok/s):

./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
  --qtip-expert-offload \
  -p "Your prompt here" -n 100

Method

We collect routing-conditioned per-expert input Hessians from only the tokens each expert actually receives during a calibration pass, producing 2048 distinct expert Hessians for the full model. These feed into an unmodified QTIP BlockLDLQ pipeline (HYB bitshift code, L=16, V=2, Tx=Ty=16, Q=9) with random Hadamard transform preprocessing. No LUT fine-tuning, no codebook modifications.

Quantization Details

Quantization method: QTIP BlockLDLQ with per-expert Hessian calibration
Bits per weight: ~2.125 (2 bits + trellis overhead)
Calibration data: 2048 sequences x 1024 tokens from C4 English train
Attention quantization: Same 2-bit method, shared Hessian (not routing-conditioned)
Router and embeddings: Kept in f32

Limitations

This is a base model. For chat/instruction-following, use an instruction-tuned variant (not yet available at 2-bit).
Generation quality is noticeably degraded compared to fp16 on complex reasoning tasks (see ARC-Challenge retention of 90.3%).
Expert offload mode runs at ~4 tok/s due to CPU-GPU transfer overhead.

Citation

Technical report forthcoming on arxiv.

License

Apache 2.0 (same as the base OLMoE model).

Acknowledgments

Built on QTIP (Tseng et al., NeurIPS 2024) and OLMoE (Muennighoff et al., 2024).

Downloads last month: 21

GGUF

Model size

7B params

Architecture

olmoe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit

OLMoE: Open Mixture-of-Experts Language Models

Paper • 2409.02060 • Published Sep 3, 2024 • 81

QTIP: Quantization with Trellises and Incoherence Processing

Paper • 2406.11235 • Published Jun 17, 2024 • 1