Hermes Edge — Intent Routing for On-Device LiteRT-LM Models

#1
by bclermo - opened

Hermes Edge — Intent Routing for On-Device LLMs (LiteRT-LM)

I built an intent routing layer for on-device LiteRT-LM models that classifies user input in ~6μs (pure keyword+regex, zero model inference) and dispatches to the optimal model + generation params:

Architecture:

  • Pre-router classifies into chat / reasoning / tools via 17 confidence-weighted regex rules
  • 270M Qwen3 model stays hot (~180 MB RAM) for instant chat
  • Specialist models (DeepSeek-R1 1.5B, Gemma-4-E2B 2.5B) load on demand via LiteRT-LM fast init
  • Per-intent params: temp, max_tokens, MTP toggled per task

Per-intent config:

Intent Model Temp Max Tokens MTP
Chat 270M Qwen3 (always hot) 0.7 128 ✅ 2.2×
Reasoning 1.5B DeepSeek-R1-Distill 0.6 384
Tools 2.5B Gemma-4-E2B 0.5 256 ✅ 2.2×

Key results:

  • 17/17 classification accuracy on diverse test queries
  • 6.2μs per classification (Python) — ~0.5μs if compiled
  • Web search bypass via DuckDuckGo Lite + UA rotation (no API key)
  • Dockerfile + CLI + HTTP server included

All model files are published on HF: https://huggingface.co/bclermo/hermes-edge

Source: https://github.com/simpliibarrii-crypto/hermes-edge

The HF Space demo: https://huggingface.co/spaces/bclermo/hermes-edge

Would love feedback from the community — especially on routing rule coverage and multi-model loading strategies on iOS/Android.

Sign up or log in to comment