Spaces:

llm-semantic-router
/

README

Running

App Files Files Community

README / README.md

Xunzhuo

Update README.md

4079234 verified 23 days ago

preview code

raw

history blame contribute delete

2.83 kB

	---
	title: README
	emoji: 📊
	colorFrom: blue
	colorTo: blue
	sdk: static
	pinned: true
	license: apache-2.0
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
	short_description: 'MoM: Specialized Models for Intelligent Routing'
	---

	![mom-family](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/M9vyenphR9xlPPfSOJyOh.png)

	One fabric. Many minds. We're introducing MoM (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.

	+ vLLM Semantic Router 👉: [project link](https://github.com/vllm-project/semantic-router)

	<!-- truncate -->

	## Why MoM?

	vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

	## MoM System Card

	A quick overview of all MoM models:

	<div align="center">

	\| Category \| Model \| Size \| Architecture \| Base Model \| Purpose \|
	\|----------\|-------\|------\|--------------\|------------\|---------\|
	\| 🧠 Intelligent Routing \| mom-brain-flash \| Flash \| Encoder \| ModernBERT \| Ultra-fast intent classification \|
	\| \| mom-brain-pro \| Pro \| Decoder \| Qwen3 0.6B \| Balanced routing with reasoning \|
	\| \| mom-brain-max \| Max \| Decoder \| Qwen3 1.7B \| Maximum accuracy for complex decisions \|
	\| 🔍 Similarity Search \| mom-similarity-flash \| Flash \| Encoder \| BERT \| Semantic similarity matching \|
	\| 🔒 Prompt Guardian \| mom-jailbreak-flash \| Flash \| Encoder \| ModernBERT \| Jailbreak/attack detection \|
	\| \| mom-pii-flash \| Flash \| Encoder \| ModernBERT \| PII detection & privacy protection \|
	\| 🎯 SLM Experts \| mom-expert-math-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend math problem solver \|
	\| \| mom-expert-science-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend science problem solver \|
	\| \| mom-expert-social-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend social sciences solver \|
	\| \| mom-expert-humanities-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend humanities solver \|
	\| \| mom-expert-law-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend law problem solver \|
	\| \| mom-expert-generalist-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend generalist solver \|

	</div>

	Key Insights:

	- 4 Categories: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
	- ModernBERT (encoder-only) → Sub-10ms latency for high-throughput routing
	- Qwen3 (decoder-only) → Explainable routing decisions + domain-specific problem solving
	- Flash models achieve 10,000+ QPS on commodity hardware
	- SLM Experts are not routers—they are specialized backend models that solve domain-specific problems

	---
	title: README
	emoji: 📊
	colorFrom: blue
	colorTo: blue
	sdk: static
	pinned: true
	license: apache-2.0
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
	short_description: 'MoM: Specialized Models for Intelligent Routing'
	---

	![mom-family](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/M9vyenphR9xlPPfSOJyOh.png)

	One fabric. Many minds. We're introducing MoM (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.

	+ vLLM Semantic Router 👉: [project link](https://github.com/vllm-project/semantic-router)

	<!-- truncate -->

	## Why MoM?

	vLLM-SR solves a critical problem: how to route LLM requests to the right model at the right time. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

	## MoM System Card

	A quick overview of all MoM models:

	<div align="center">

	\| Category \| Model \| Size \| Architecture \| Base Model \| Purpose \|
	\|----------\|-------\|------\|--------------\|------------\|---------\|
	\| 🧠 Intelligent Routing \| mom-brain-flash \| Flash \| Encoder \| ModernBERT \| Ultra-fast intent classification \|
	\| \| mom-brain-pro \| Pro \| Decoder \| Qwen3 0.6B \| Balanced routing with reasoning \|
	\| \| mom-brain-max \| Max \| Decoder \| Qwen3 1.7B \| Maximum accuracy for complex decisions \|
	\| 🔍 Similarity Search \| mom-similarity-flash \| Flash \| Encoder \| BERT \| Semantic similarity matching \|
	\| 🔒 Prompt Guardian \| mom-jailbreak-flash \| Flash \| Encoder \| ModernBERT \| Jailbreak/attack detection \|
	\| \| mom-pii-flash \| Flash \| Encoder \| ModernBERT \| PII detection & privacy protection \|
	\| 🎯 SLM Experts \| mom-expert-math-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend math problem solver \|
	\| \| mom-expert-science-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend science problem solver \|
	\| \| mom-expert-social-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend social sciences solver \|
	\| \| mom-expert-humanities-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend humanities solver \|
	\| \| mom-expert-law-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend law problem solver \|
	\| \| mom-expert-generalist-flash \| Flash \| Decoder \| Qwen3 0.6B \| Backend generalist solver \|

	</div>

	Key Insights:

	- 4 Categories: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
	- ModernBERT (encoder-only) → Sub-10ms latency for high-throughput routing
	- Qwen3 (decoder-only) → Explainable routing decisions + domain-specific problem solving
	- Flash models achieve 10,000+ QPS on commodity hardware
	- SLM Experts are not routers—they are specialized backend models that solve domain-specific problems