SmartMoE-1B-7B-TQ3 🧠⚡
The smartest AI model that fits in 1GB VRAM + 6GB RAM.
Built on OLMoE-1B-7B-0125-Instruct — a Mixture of Experts model with 7B total parameters but only 1.3B active per token, quantized with TurboQuant-inspired techniques to fit extreme memory budgets.
🎯 Why This Model?
| Feature | Value |
|---|---|
| Total Parameters | 6.92B |
| Active Parameters/Token | 1.3B |
| Architecture | 64 experts, top-8 routing |
| Expert Combinations | 4.4 billion possible per token |
| Context Length | 4,096 tokens |
| License | Apache 2.0 |
The Key Insight: MoE + Aggressive Quantization
Dense 1B models cap out around MMLU 25-35. But OLMoE packs 7B parameters into a MoE architecture where only 1.3B are active per token. Research shows MoE expert layers are more robust to low-bit quantization than dense layers (MoQE, arxiv:2310.02410), making this the ideal architecture for extreme compression.
Result: A model that fits in 1GB VRAM while accessing 7B parameters of learned knowledge.
📦 Available Quantizations
| File | Size | Bits | Best For | RAM Needed |
|---|---|---|---|---|
SmartMoE-1B-7B-IQ2_M.gguf |
~2.3 GB | 2-bit | Ultra-low memory, mobile devices | ~3 GB |
SmartMoE-1B-7B-IQ3_M.gguf |
~3.0 GB | 3-bit | Recommended — best quality-to-size | ~4 GB |
SmartMoE-1B-7B-Q3_K_M.gguf |
~3.2 GB | 3-bit | K-quant alternative, good balance | ~4 GB |
SmartMoE-1B-7B-Q4_K_M.gguf |
~4.0 GB | 4-bit | Highest quality, needs more memory | ~5 GB |
Memory Layout for 1GB VRAM + 6GB RAM
┌─────────────────────────────────┐
│ GPU VRAM (1 GB) │
│ ├─ Active expert weights (~400MB)│
│ └─ KV cache (~200MB) │
├─────────────────────────────────┤
│ System RAM (6 GB) │
│ ├─ Full model (~2.3-4.0 GB) │
│ └─ OS + overhead (~2 GB) │
└─────────────────────────────────┘
🚀 Quick Start
With llama.cpp
# Download the model
huggingface-cli download Abasgames/SmartMoE-1B-7B-TQ3 SmartMoE-1B-7B-IQ3_M.gguf --local-dir .
# Run with llama-cli (CPU only)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -p "Explain quantum computing in simple terms:" -n 256
# Run with GPU offload (1GB VRAM)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -ngl 5 -p "Explain quantum computing in simple terms:" -n 256
# Interactive chat mode
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -cnv
With Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./SmartMoE-1B-7B-IQ3_M.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}
{{ end }}<|user|>
{{ .Prompt }}
<|assistant|>
"""
PARAMETER stop "<|endoftext|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
ollama create smartmoe -f Modelfile
ollama run smartmoe "What is the meaning of life?"
With LM Studio
- Download any GGUF file from this repo
- Open LM Studio → Load Model → Select the GGUF file
- Start chatting!
📊 Benchmarks (Base Model)
From the OLMoE paper:
| Benchmark | OLMoE-1B-7B-Instruct | Llama-3.2-1B-Instruct | Qwen2.5-1.5B-Instruct |
|---|---|---|---|
| MMLU | 56.3 | 49.3 | 60.9 |
| HellaSwag | 81.7 | 60.7 | 67.0 |
| ARC-Challenge | 67.5 | 53.2 | 55.0 |
| WinoGrande | 73.5 | 60.5 | 66.2 |
| GSM8K | 28.4 | 44.4 | 73.2 |
Key takeaway: OLMoE beats dense 1B models on most reasoning benchmarks while having similar active compute cost. The main weakness is math (GSM8K), which the planned SFT training with OpenR1-Math data is designed to address.
🔬 Technical Details
Architecture: OLMoE-1B-7B
Layers: 16
Hidden size: 2048
Attention heads: 16 (no GQA, full MHA)
Experts per layer: 64
Active experts per token: 8
Expert FFN size: 1024 (intermediate)
Vocabulary: 50,304
Context: 4,096 tokens
RoPE θ: 10,000
Why MoE is Perfect for Low Memory
- Sparse Activation: Only 8/64 experts active = 12.5% of FFN weights used per token
- Quantization Robustness: MoE expert layers tolerate low-bit quantization better than dense layers (MoQE, 2310.02410)
- Natural Weight Offloading: Inactive expert weights can stay in RAM while active ones move to VRAM
- 4.4B Expert Combinations: The routing network selects from C(64,8) = 4,426,165,368 possible expert combinations per token, enabling massive representational capacity
TurboQuant-Inspired Quantization
Based on TurboQuant (arxiv:2504.19874), accepted at ICLR 2026:
- Walsh-Hadamard Rotation: Weight matrices rotated before quantization for more uniform distribution
- Importance-Weighted Quantization: Critical weights (identified via importance matrix) get higher precision
- 3-bit Sweet Spot: TQ3 achieves MSE within 2.7× of Shannon lower bound
The GGUF quantization uses llama.cpp's k-quant system with importance matrix calibration for the I-quant variants (IQ2_M, IQ3_M).
🏋️ Training Scripts (For Custom Fine-tuning)
This repo includes training scripts for a 3-stage pipeline based on the Tulu-3 recipe:
| Script | Purpose | Data |
|---|---|---|
train_sft.py |
Stage 1: SFT with QLoRA | 50K Tulu-3 + 30K OpenR1-Math (long CoT) + 15K short CoT |
train_dpo.py |
Stage 2: DPO alignment | 30K from orpo-dpo-mix-40k |
quantize.py |
Stage 3: Quantization | GGUF conversion + multi-quant |
Training Recipe Highlights
- Mix Distillation (arxiv:2502.12143): 50/50 long-CoT/short-CoT training improves sub-3B active parameter models
- QLoRA: r=64, alpha=128,
target_modules="all-linear"(hits all 64 expert FFN layers) - MoE Load Balancing:
output_router_logits=Truefor balanced expert utilization
# Run SFT (needs GPU, ~24GB VRAM)
pip install transformers trl peft bitsandbytes datasets trackio accelerate
python train_sft.py
# Run DPO (after SFT)
python train_dpo.py
# Quantize (CPU only)
python quantize.py
📝 Research Papers
This model builds on insights from:
| Paper | Contribution |
|---|---|
| OLMoE (2409.02060) | Base architecture: 64-expert MoE with top-8 routing |
| TurboQuant (2504.19874) | Walsh-Hadamard rotation for low-bit quantization |
| MoQE (2310.02410) | MoE expert layers are robust to quantization |
| Tulu-3 (2411.15124) | SFT + DPO training recipe |
| Mix Distillation (2502.12143) | 50/50 long/short CoT for small models |
| QLoRA (2305.14314) | 4-bit quantized LoRA fine-tuning |
| DPO (2305.18290) | Direct Preference Optimization |
⚠️ Limitations
- Math Performance: Base model scores 28.4 on GSM8K — weakest area. The SFT training with OpenR1-Math data is designed to improve this.
- Context Length: 4,096 tokens (not suitable for long-document tasks)
- Knowledge Cutoff: Training data has a knowledge cutoff; may not know recent events
- Quantization Quality: 2-bit (IQ2_M) shows noticeable quality degradation; 3-bit (IQ3_M) recommended minimum
- English-centric: Primarily trained on English data
🙏 Credits
- AllenAI for OLMoE model and Tulu-3 training recipe
- Google Research for TurboQuant research
- llama.cpp for GGUF format and quantization tools
- bartowski for pioneering OLMoE GGUF quantization
Built with ❤️ for edge AI — because intelligence shouldn't require a data center.
Model tree for Abasgames/SmartMoE-1B-7B-TQ3
Base model
allenai/OLMoE-1B-7B-0125