SmartMoE-1B-7B-TQ3 🧠⚡

The smartest AI model that fits in 1GB VRAM + 6GB RAM.

Built on OLMoE-1B-7B-0125-Instruct — a Mixture of Experts model with 7B total parameters but only 1.3B active per token, quantized with TurboQuant-inspired techniques to fit extreme memory budgets.

🎯 Why This Model?

Feature Value
Total Parameters 6.92B
Active Parameters/Token 1.3B
Architecture 64 experts, top-8 routing
Expert Combinations 4.4 billion possible per token
Context Length 4,096 tokens
License Apache 2.0

The Key Insight: MoE + Aggressive Quantization

Dense 1B models cap out around MMLU 25-35. But OLMoE packs 7B parameters into a MoE architecture where only 1.3B are active per token. Research shows MoE expert layers are more robust to low-bit quantization than dense layers (MoQE, arxiv:2310.02410), making this the ideal architecture for extreme compression.

Result: A model that fits in 1GB VRAM while accessing 7B parameters of learned knowledge.

📦 Available Quantizations

File Size Bits Best For RAM Needed
SmartMoE-1B-7B-IQ2_M.gguf ~2.3 GB 2-bit Ultra-low memory, mobile devices ~3 GB
SmartMoE-1B-7B-IQ3_M.gguf ~3.0 GB 3-bit Recommended — best quality-to-size ~4 GB
SmartMoE-1B-7B-Q3_K_M.gguf ~3.2 GB 3-bit K-quant alternative, good balance ~4 GB
SmartMoE-1B-7B-Q4_K_M.gguf ~4.0 GB 4-bit Highest quality, needs more memory ~5 GB

Memory Layout for 1GB VRAM + 6GB RAM

┌─────────────────────────────────┐
│ GPU VRAM (1 GB)                 │
│ ├─ Active expert weights (~400MB)│
│ └─ KV cache (~200MB)           │
├─────────────────────────────────┤
│ System RAM (6 GB)               │
│ ├─ Full model (~2.3-4.0 GB)    │
│ └─ OS + overhead (~2 GB)       │
└─────────────────────────────────┘

🚀 Quick Start

With llama.cpp

# Download the model
huggingface-cli download Abasgames/SmartMoE-1B-7B-TQ3 SmartMoE-1B-7B-IQ3_M.gguf --local-dir .

# Run with llama-cli (CPU only)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -p "Explain quantum computing in simple terms:" -n 256

# Run with GPU offload (1GB VRAM)
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -ngl 5 -p "Explain quantum computing in simple terms:" -n 256

# Interactive chat mode
./llama-cli -m SmartMoE-1B-7B-IQ3_M.gguf -cnv

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./SmartMoE-1B-7B-IQ3_M.gguf
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}
{{ end }}<|user|>
{{ .Prompt }}
<|assistant|>
"""
PARAMETER stop "<|endoftext|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

ollama create smartmoe -f Modelfile
ollama run smartmoe "What is the meaning of life?"

With LM Studio

  1. Download any GGUF file from this repo
  2. Open LM Studio → Load Model → Select the GGUF file
  3. Start chatting!

📊 Benchmarks (Base Model)

From the OLMoE paper:

Benchmark OLMoE-1B-7B-Instruct Llama-3.2-1B-Instruct Qwen2.5-1.5B-Instruct
MMLU 56.3 49.3 60.9
HellaSwag 81.7 60.7 67.0
ARC-Challenge 67.5 53.2 55.0
WinoGrande 73.5 60.5 66.2
GSM8K 28.4 44.4 73.2

Key takeaway: OLMoE beats dense 1B models on most reasoning benchmarks while having similar active compute cost. The main weakness is math (GSM8K), which the planned SFT training with OpenR1-Math data is designed to address.

🔬 Technical Details

Architecture: OLMoE-1B-7B

Layers: 16
Hidden size: 2048
Attention heads: 16 (no GQA, full MHA)
Experts per layer: 64
Active experts per token: 8
Expert FFN size: 1024 (intermediate)
Vocabulary: 50,304
Context: 4,096 tokens
RoPE θ: 10,000

Why MoE is Perfect for Low Memory

  1. Sparse Activation: Only 8/64 experts active = 12.5% of FFN weights used per token
  2. Quantization Robustness: MoE expert layers tolerate low-bit quantization better than dense layers (MoQE, 2310.02410)
  3. Natural Weight Offloading: Inactive expert weights can stay in RAM while active ones move to VRAM
  4. 4.4B Expert Combinations: The routing network selects from C(64,8) = 4,426,165,368 possible expert combinations per token, enabling massive representational capacity

TurboQuant-Inspired Quantization

Based on TurboQuant (arxiv:2504.19874), accepted at ICLR 2026:

  • Walsh-Hadamard Rotation: Weight matrices rotated before quantization for more uniform distribution
  • Importance-Weighted Quantization: Critical weights (identified via importance matrix) get higher precision
  • 3-bit Sweet Spot: TQ3 achieves MSE within 2.7× of Shannon lower bound

The GGUF quantization uses llama.cpp's k-quant system with importance matrix calibration for the I-quant variants (IQ2_M, IQ3_M).

🏋️ Training Scripts (For Custom Fine-tuning)

This repo includes training scripts for a 3-stage pipeline based on the Tulu-3 recipe:

Script Purpose Data
train_sft.py Stage 1: SFT with QLoRA 50K Tulu-3 + 30K OpenR1-Math (long CoT) + 15K short CoT
train_dpo.py Stage 2: DPO alignment 30K from orpo-dpo-mix-40k
quantize.py Stage 3: Quantization GGUF conversion + multi-quant

Training Recipe Highlights

  • Mix Distillation (arxiv:2502.12143): 50/50 long-CoT/short-CoT training improves sub-3B active parameter models
  • QLoRA: r=64, alpha=128, target_modules="all-linear" (hits all 64 expert FFN layers)
  • MoE Load Balancing: output_router_logits=True for balanced expert utilization
# Run SFT (needs GPU, ~24GB VRAM)
pip install transformers trl peft bitsandbytes datasets trackio accelerate
python train_sft.py

# Run DPO (after SFT)
python train_dpo.py

# Quantize (CPU only)
python quantize.py

📝 Research Papers

This model builds on insights from:

Paper Contribution
OLMoE (2409.02060) Base architecture: 64-expert MoE with top-8 routing
TurboQuant (2504.19874) Walsh-Hadamard rotation for low-bit quantization
MoQE (2310.02410) MoE expert layers are robust to quantization
Tulu-3 (2411.15124) SFT + DPO training recipe
Mix Distillation (2502.12143) 50/50 long/short CoT for small models
QLoRA (2305.14314) 4-bit quantized LoRA fine-tuning
DPO (2305.18290) Direct Preference Optimization

⚠️ Limitations

  • Math Performance: Base model scores 28.4 on GSM8K — weakest area. The SFT training with OpenR1-Math data is designed to improve this.
  • Context Length: 4,096 tokens (not suitable for long-document tasks)
  • Knowledge Cutoff: Training data has a knowledge cutoff; may not know recent events
  • Quantization Quality: 2-bit (IQ2_M) shows noticeable quality degradation; 3-bit (IQ3_M) recommended minimum
  • English-centric: Primarily trained on English data

🙏 Credits


Built with ❤️ for edge AI — because intelligence shouldn't require a data center.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abasgames/SmartMoE-1B-7B-TQ3

Papers for Abasgames/SmartMoE-1B-7B-TQ3