Hermes Edge — Intent Routing for On-Device LiteRT-LM Models

by bclermo - opened 1 day ago

Owner 1 day ago

Hermes Edge — Intent Routing for On-Device LLMs (LiteRT-LM)

I built an intent routing layer for on-device LiteRT-LM models that classifies user input in ~6μs (pure keyword+regex, zero model inference) and dispatches to the optimal model + generation params:

Architecture:

Pre-router classifies into chat / reasoning / tools via 17 confidence-weighted regex rules
270M Qwen3 model stays hot (~180 MB RAM) for instant chat
Specialist models (DeepSeek-R1 1.5B, Gemma-4-E2B 2.5B) load on demand via LiteRT-LM fast init
Per-intent params: temp, max_tokens, MTP toggled per task

Per-intent config:

Intent	Model	Temp	Max Tokens	MTP
Chat	270M Qwen3 (always hot)	0.7	128	✅ 2.2×
Reasoning	1.5B DeepSeek-R1-Distill	0.6	384	❌
Tools	2.5B Gemma-4-E2B	0.5	256	✅ 2.2×

Key results:

17/17 classification accuracy on diverse test queries
6.2μs per classification (Python) — ~0.5μs if compiled
Web search bypass via DuckDuckGo Lite + UA rotation (no API key)
Dockerfile + CLI + HTTP server included

All model files are published on HF: https://huggingface.co/bclermo/hermes-edge

Source: https://github.com/simpliibarrii-crypto/hermes-edge

The HF Space demo: https://huggingface.co/spaces/bclermo/hermes-edge

Would love feedback from the community — especially on routing rule coverage and multi-model loading strategies on iOS/Android.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment