Spaces:

llm-semantic-router
/

README

Running

File size: 2,831 Bytes

---
title: README
emoji: 📊
colorFrom: blue
colorTo: blue
sdk: static
pinned: true
license: apache-2.0
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
short_description: 'MoM: Specialized Models for Intelligent Routing'
---

![mom-family](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/M9vyenphR9xlPPfSOJyOh.png)

**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.

+ vLLM Semantic Router 👉: [project link](https://github.com/vllm-project/semantic-router)

<!-- truncate -->

## Why MoM?

vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

## MoM System Card

A quick overview of all MoM models:

<div align="center">

| Category | Model | Size | Architecture | Base Model | Purpose |
|----------|-------|------|--------------|------------|---------|
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
| **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | BERT | Semantic similarity matching |
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |

</div>

**Key Insights:**

- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
- **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving
- **Flash** models achieve 10,000+ QPS on commodity hardware
- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems