SmartMoE-1B-7B-TQ3 🧠⚡

The smartest AI model that fits in 1GB VRAM + 6GB RAM.

Built on OLMoE-1B-7B-0125-Instruct — a Mixture of Experts model with 7B total parameters but only 1.3B active per token, quantized with TurboQuant-inspired techniques to fit extreme memory budgets.

🎯 Why This Model?

Feature	Value
Total Parameters	6.92B
Active Parameters/Token	1.3B
Architecture	64 experts, top-8 routing
Expert Combinations	4.4 billion possible per token
Context Length	4,096 tokens
License	Apache 2.0

The Key Insight: MoE + Aggressive Quantization

Dense 1B models cap out around MMLU 25-35. But OLMoE packs 7B parameters into a MoE architecture where only 1.3B are active per token. Research shows MoE expert layers are more robust to low-bit quantization than dense layers (MoQE, arxiv:2310.02410), making this the ideal architecture for extreme compression.

Result: A model that fits in 1GB VRAM while accessing 7B parameters of learned knowledge.

📦 Available Quantizations

File	Size	Bits	Best For	RAM Needed
`SmartMoE-1B-7B-IQ2_M.gguf`	~2.3 GB	2-bit	Ultra-low memory, mobile devices	~3 GB
`SmartMoE-1B-7B-IQ3_M.gguf`	~3.0 GB	3-bit	Recommended — best quality-to-size	~4 GB
`SmartMoE-1B-7B-Q3_K_M.gguf`	~3.2 GB	3-bit	K-quant alternative, good balance	~4 GB
`SmartMoE-1B-7B-Q4_K_M.gguf`	~4.0 GB	4-bit	Highest quality, needs more memory	~5 GB

Memory Layout for 1GB VRAM + 6GB RAM

┌─────────────────────────────────┐
│ GPU VRAM (1 GB)                 │
│ ├─ Active expert weights (~400MB)│
│ └─ KV cache (~200MB)           │
├─────────────────────────────────┤
│ System RAM (6 GB)               │
│ ├─ Full model (~2.3-4.0 GB)    │
│ └─ OS + overhead (~2 GB)       │
└─────────────────────────────────┘

🚀 Quick Start

With llama.cpp

# Download the model
huggingface-cli download Abasgames/SmartMoE-1B-7B-TQ3 SmartMoE-1B-7B-IQ3_M.gguf --local-dir .

# Run with llama-cli (CPU only)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -p "Explain quantum computing in simple terms:" -n 256

# Run with GPU offload (1GB VRAM)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -ngl 5 -p "Explain quantum computing in simple terms:" -n 256

# Interactive chat mode
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -cnv

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./SmartMoE-1B-7B-IQ3_M.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}
{{ end }}<|user|>
{{ .Prompt }}
<|assistant|>
"""
PARAMETER stop "<|endoftext|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

ollama create smartmoe -f Modelfile
ollama run smartmoe "What is the meaning of life?"

With LM Studio

Download any GGUF file from this repo
Open LM Studio → Load Model → Select the GGUF file
Start chatting!

📊 Benchmarks (Base Model)

From the OLMoE paper:

Benchmark	OLMoE-1B-7B-Instruct	Llama-3.2-1B-Instruct	Qwen2.5-1.5B-Instruct
MMLU	56.3	49.3	60.9
HellaSwag	81.7	60.7	67.0
ARC-Challenge	67.5	53.2	55.0
WinoGrande	73.5	60.5	66.2
GSM8K	28.4	44.4	73.2

Key takeaway: OLMoE beats dense 1B models on most reasoning benchmarks while having similar active compute cost. The main weakness is math (GSM8K), which the planned SFT training with OpenR1-Math data is designed to address.

🔬 Technical Details

Architecture: OLMoE-1B-7B

Layers: 16
Hidden size: 2048
Attention heads: 16 (no GQA, full MHA)
Experts per layer: 64
Active experts per token: 8
Expert FFN size: 1024 (intermediate)
Vocabulary: 50,304
Context: 4,096 tokens
RoPE θ: 10,000

Why MoE is Perfect for Low Memory

Sparse Activation: Only 8/64 experts active = 12.5% of FFN weights used per token
Quantization Robustness: MoE expert layers tolerate low-bit quantization better than dense layers (MoQE, 2310.02410)
Natural Weight Offloading: Inactive expert weights can stay in RAM while active ones move to VRAM
4.4B Expert Combinations: The routing network selects from C(64,8) = 4,426,165,368 possible expert combinations per token, enabling massive representational capacity

TurboQuant-Inspired Quantization

Based on TurboQuant (arxiv:2504.19874), accepted at ICLR 2026:

Walsh-Hadamard Rotation: Weight matrices rotated before quantization for more uniform distribution
Importance-Weighted Quantization: Critical weights (identified via importance matrix) get higher precision
3-bit Sweet Spot: TQ3 achieves MSE within 2.7× of Shannon lower bound

The GGUF quantization uses llama.cpp's k-quant system with importance matrix calibration for the I-quant variants (IQ2_M, IQ3_M).

🏋️ Training Scripts (For Custom Fine-tuning)

This repo includes training scripts for a 3-stage pipeline based on the Tulu-3 recipe:

Script	Purpose	Data
`train_sft.py`	Stage 1: SFT with QLoRA	50K Tulu-3 + 30K OpenR1-Math (long CoT) + 15K short CoT
`train_dpo.py`	Stage 2: DPO alignment	30K from orpo-dpo-mix-40k
`quantize.py`	Stage 3: Quantization	GGUF conversion + multi-quant

Training Recipe Highlights

Mix Distillation (arxiv:2502.12143): 50/50 long-CoT/short-CoT training improves sub-3B active parameter models
QLoRA: r=64, alpha=128, target_modules="all-linear" (hits all 64 expert FFN layers)
MoE Load Balancing: output_router_logits=True for balanced expert utilization

# Run SFT (needs GPU, ~24GB VRAM)
pip install transformers trl peft bitsandbytes datasets trackio accelerate
python train_sft.py

# Run DPO (after SFT)
python train_dpo.py

# Quantize (CPU only)
python quantize.py

📝 Research Papers

This model builds on insights from:

Paper	Contribution
OLMoE (2409.02060)	Base architecture: 64-expert MoE with top-8 routing
TurboQuant (2504.19874)	Walsh-Hadamard rotation for low-bit quantization
MoQE (2310.02410)	MoE expert layers are robust to quantization
Tulu-3 (2411.15124)	SFT + DPO training recipe
Mix Distillation (2502.12143)	50/50 long/short CoT for small models
QLoRA (2305.14314)	4-bit quantized LoRA fine-tuning
DPO (2305.18290)	Direct Preference Optimization

⚠️ Limitations

Math Performance: Base model scores 28.4 on GSM8K — weakest area. The SFT training with OpenR1-Math data is designed to improve this.
Context Length: 4,096 tokens (not suitable for long-document tasks)
Knowledge Cutoff: Training data has a knowledge cutoff; may not know recent events
Quantization Quality: 2-bit (IQ2_M) shows noticeable quality degradation; 3-bit (IQ3_M) recommended minimum
English-centric: Primarily trained on English data