QueryComplexityRouter

A fast, lightweight 3-class classifier that decides how much LLM power a query needs β€” before you spend tokens on it.

Built on DistilBERT (66M params), fine-tuned to classify any user message into one of three complexity tiers:

Label Meaning Suggested Action
no_llm Answerable with rules, lookup, or regex Skip the LLM entirely
small_llm A 1–3B model (Phi-3, Gemma-2B) is sufficient Route to a cheap local model
large_llm Requires 7B+ or frontier model (GPT-4, Claude) Route to powerful model

Why This Exists

Running every query through a frontier LLM is expensive and slow. But you also don't want to under-serve complex queries with a tiny model.

QueryComplexityRouter sits at the top of your pipeline and makes this decision in ~10ms on CPU β€” before any LLM call is made.

Pair it with AgentIntentRouter for a full 2-stage routing pipeline:

User Message
    β”‚
    β–Ό
AgentIntentRouter          ← What does the user want? (code, search, chat, ...)
    β”‚
    β–Ό
QueryComplexityRouter      ← How hard is it? (no_llm / small_llm / large_llm)
    β”‚
    β–Ό
Route to the right tool/model

Quick Start

from transformers import pipeline

router = pipeline("text-classification", model="tripathyShaswata/QueryComplexityRouter")

# Single prediction
result = router("What is 15% of 4500?")
print(result)
# [{'label': 'no_llm', 'score': 0.98}]

# Batch
messages = [
    "What is the capital of France?",           # no_llm
    "Explain recursion in simple terms.",        # small_llm
    "Write a 1000-word blog post about AI.",     # large_llm
    "Design a distributed caching system.",      # large_llm
    "Fix this bug: def add(a,b): return a-b",   # small_llm
]
results = router(messages)
for msg, res in zip(messages, results):
    print(f"  {res['label']:>12} ({res['score']:.2f}) β€” {msg}")

2-Stage Routing Pipeline

from transformers import pipeline

intent_router = pipeline("text-classification", model="tripathyShaswata/AgentIntentRouter")
complexity_router = pipeline("text-classification", model="tripathyShaswata/QueryComplexityRouter")

def route(user_message: str):
    intent = intent_router(user_message)[0]
    complexity = complexity_router(user_message)[0]

    print(f"Intent:     {intent['label']} ({intent['score']:.2f})")
    print(f"Complexity: {complexity['label']} ({complexity['score']:.2f})")

    if complexity["label"] == "no_llm":
        return handle_with_rules(user_message, intent["label"])
    elif complexity["label"] == "small_llm":
        return call_small_model(user_message)
    else:
        return call_large_model(user_message)

Complexity Labels

no_llm β€” No LLM needed

  • Simple math: "What is 42 * 7?"
  • Unit conversion: "Convert 100km to miles"
  • Factual lookup: "What is the capital of Japan?"
  • Date/time: "What day is March 15 2026?"
  • Simple commands: "Set a timer for 5 minutes"

small_llm β€” 1–3B model sufficient

  • Short summarization: "Summarize this paragraph..."
  • Basic explanation: "Explain recursion to a 10-year-old"
  • Simple code: "Write a Python function to reverse a string"
  • Short generation: "Write a one-line bio for a software engineer"
  • Simple classification: "Is this email spam?"

large_llm β€” 7B+ / frontier model required

  • Deep reasoning: "Analyze the ethical implications of AI replacing jobs"
  • Long-form writing: "Write a 1000-word blog post about quantum computing"
  • Complex code: "Build a REST API with auth, error handling, and tests"
  • Multi-doc synthesis: "Given these 5 documents, synthesize an answer..."
  • System design: "Design a distributed caching system with eventual consistency"

Performance

  • Inference speed: ~10ms on CPU, ~2ms on GPU
  • Model size: ~260MB (DistilBERT-base)

Evaluation Results

Results on held-out test set:

Metric Score
Accuracy ~0.99
F1 (weighted) ~0.99

Per-class performance:

Class Precision Recall F1
no_llm ~1.00 ~1.00 ~1.00
small_llm ~0.98 ~0.98 ~0.98
large_llm ~0.99 ~0.99 ~0.99

Note: Results on synthetic test data from the same distribution as training. Real-world performance will vary. Use the confidence score threshold to handle ambiguous inputs gracefully.

Training Details

  • Base model: distilbert-base-uncased
  • Training data: 1,400 synthetic examples per class (4,200 total), template-generated with natural language variation
  • Epochs: 5 (with early stopping, patience=2)
  • Learning rate: 2e-5
  • Batch size: 32
  • Max sequence length: 128

Use in Agent Pipelines

COMPLEXITY_THRESHOLDS = {
    "no_llm": 0.7,
    "small_llm": 0.6,
    "large_llm": 0.6,
}

def smart_route(message: str):
    result = router(message)[0]
    label, score = result["label"], result["score"]

    if score < COMPLEXITY_THRESHOLDS[label]:
        # Low confidence β€” default to large_llm for safety
        label = "large_llm"

    return label

Limitations

  • Trained on English text only
  • Template-generated data may not cover all edge cases
  • Borderline queries (e.g., "explain quantum entanglement") may get lower confidence β€” use threshold fallback
  • Complexity is query-level only; does not account for context window length or domain expertise needed

Related Models

License

Apache 2.0 β€” use it however you want, commercial included.

Citation

If this helps you, a star is appreciated!

Downloads last month
23
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support