Instructions to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Trellis
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Trellis:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit", filename="olmoe-qtip-2b-v2.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit # Run inference directly in the terminal: llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit # Run inference directly in the terminal: llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit # Run inference directly in the terminal: ./llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit # Run inference directly in the terminal: ./build/bin/llama-cli -hf Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
Use Docker
docker model run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
- LM Studio
- Jan
- Ollama
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Ollama:
ollama run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
- Unsloth Studio new
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit to start chatting
- Docker Model Runner
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Docker Model Runner:
docker model run hf.co/Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
- Lemonade
How to use Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Venugopalan2610/OLMoE-1B-7B-0125-QTIP-2bit
Run and chat with the model
lemonade run user.OLMoE-1B-7B-0125-QTIP-2bit-{{QUANT_TAG}}List all available models
lemonade list
OLMoE-1B-7B-0125 — QTIP 2-Bit (W2A16)
2-bit weight-only quantization of allenai/OLMoE-1B-7B-0125 via per-expert trellis-coded quantization (QTIP BlockLDLQ with per-expert Hessian calibration).
Key Numbers
| Metric | Value |
|---|---|
| Model size on disk | 2.47 GB |
| GPU VRAM (including KV cache) | 2.7 GB |
| Generation speed | 13 tok/s (RTX 4080 Laptop) |
| Prompt processing | 32 tok/s |
| WikiText-2 PPL | 9.09 (fp16: 6.65, ratio 1.367x) |
| C4 PPL | 14.16 (fp16: 12.24, ratio 1.157x) |
| HellaSwag acc_norm | 71.15% (fp16: 78.26%, retention 90.9%) |
| PIQA acc_norm | 77.97% (fp16: 79.71%, retention 97.8%) |
| ARC-Challenge acc_norm | 44.28% (fp16: 49.06%, retention 90.3%) |
What This Is
A 7-billion-parameter Mixture-of-Experts model compressed to 2 bits per weight using QTIP's trellis-coded quantization with routing-conditioned per-expert Hessian calibration. The model fits entirely in GPU memory on devices with as little as 4 GB VRAM and generates at 13 tok/s on a laptop GPU.
This is a base model (not instruction-tuned). It performs text completion, not chat.
How to Run
Requires our llama.cpp fork with QTIP 2-bit support. Important: use the qtip-olmoe-2bit branch.
With CUDA (GPU inference):
git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
-p "Mixture of Experts models use a routing mechanism to" \
-n 100 --temp 0.7 --repeat-penalty 1.1
CPU-only (no CUDA required):
git clone -b qtip-olmoe-2bit https://github.com/Venugopalan2610/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
-p "Your prompt here" -n 100
Expert offload (ultra-low-VRAM devices, ~4 tok/s):
./build/bin/llama-completion -m olmoe-qtip-2b-v2.gguf \
--qtip-expert-offload \
-p "Your prompt here" -n 100
Method
We collect routing-conditioned per-expert input Hessians from only the tokens each expert actually receives during a calibration pass, producing 2048 distinct expert Hessians for the full model. These feed into an unmodified QTIP BlockLDLQ pipeline (HYB bitshift code, L=16, V=2, Tx=Ty=16, Q=9) with random Hadamard transform preprocessing. No LUT fine-tuning, no codebook modifications.
Quantization Details
- Quantization method: QTIP BlockLDLQ with per-expert Hessian calibration
- Bits per weight: ~2.125 (2 bits + trellis overhead)
- Calibration data: 2048 sequences x 1024 tokens from C4 English train
- Attention quantization: Same 2-bit method, shared Hessian (not routing-conditioned)
- Router and embeddings: Kept in f32
Limitations
- This is a base model. For chat/instruction-following, use an instruction-tuned variant (not yet available at 2-bit).
- Generation quality is noticeably degraded compared to fp16 on complex reasoning tasks (see ARC-Challenge retention of 90.3%).
- Expert offload mode runs at ~4 tok/s due to CPU-GPU transfer overhead.
Citation
Technical report forthcoming on arxiv.
License
Apache 2.0 (same as the base OLMoE model).
Acknowledgments
Built on QTIP (Tseng et al., NeurIPS 2024) and OLMoE (Muennighoff et al., 2024).
- Downloads last month
- 21
We're not able to determine the quantization variants.