Spaces:
Running
Running
| title: README | |
| emoji: 📊 | |
| colorFrom: blue | |
| colorTo: blue | |
| sdk: static | |
| pinned: true | |
| license: apache-2.0 | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png | |
| short_description: 'MoM: Specialized Models for Intelligent Routing' | |
|  | |
| **One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making. | |
| + vLLM Semantic Router 👉: [project link](https://github.com/vllm-project/semantic-router) | |
| <!-- truncate --> | |
| ## Why MoM? | |
| vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract." | |
| ## MoM System Card | |
| A quick overview of all MoM models: | |
| <div align="center"> | |
| | Category | Model | Size | Architecture | Base Model | Purpose | | |
| |----------|-------|------|--------------|------------|---------| | |
| | **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification | | |
| | | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning | | |
| | | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions | | |
| | **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | BERT | Semantic similarity matching | | |
| | **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection | | |
| | | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection | | |
| | **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver | | |
| | | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver | | |
| | | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver | | |
| | | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver | | |
| | | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver | | |
| | | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver | | |
| </div> | |
| **Key Insights:** | |
| - **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts) | |
| - **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing | |
| - **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving | |
| - **Flash** models achieve 10,000+ QPS on commodity hardware | |
| - **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems |