Text Generation
LiteRT-LM
English
custom
hermes-edge
mobile-ai
on-device
ios
iphone-16
apple-neural-engine
deepseek
dspark
speculative-decoding
hermes-agent
tool-calling
raven-ecosystem
Instructions to use bclermo/hermes-edge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use bclermo/hermes-edge with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=bclermo/hermes-edge \ model.litertlm \ --prompt="Write me a poem"
- Notebooks
- Google Colab
- Kaggle
Hermes Edge — Intent Routing for On-Device LiteRT-LM Models
#1
by bclermo - opened
Hermes Edge — Intent Routing for On-Device LLMs (LiteRT-LM)
I built an intent routing layer for on-device LiteRT-LM models that classifies user input in ~6μs (pure keyword+regex, zero model inference) and dispatches to the optimal model + generation params:
Architecture:
- Pre-router classifies into
chat/reasoning/toolsvia 17 confidence-weighted regex rules - 270M Qwen3 model stays hot (~180 MB RAM) for instant chat
- Specialist models (DeepSeek-R1 1.5B, Gemma-4-E2B 2.5B) load on demand via LiteRT-LM fast init
- Per-intent params: temp, max_tokens, MTP toggled per task
Per-intent config:
| Intent | Model | Temp | Max Tokens | MTP |
|---|---|---|---|---|
| Chat | 270M Qwen3 (always hot) | 0.7 | 128 | ✅ 2.2× |
| Reasoning | 1.5B DeepSeek-R1-Distill | 0.6 | 384 | ❌ |
| Tools | 2.5B Gemma-4-E2B | 0.5 | 256 | ✅ 2.2× |
Key results:
- 17/17 classification accuracy on diverse test queries
- 6.2μs per classification (Python) — ~0.5μs if compiled
- Web search bypass via DuckDuckGo Lite + UA rotation (no API key)
- Dockerfile + CLI + HTTP server included
All model files are published on HF: https://huggingface.co/bclermo/hermes-edge
Source: https://github.com/simpliibarrii-crypto/hermes-edge
The HF Space demo: https://huggingface.co/spaces/bclermo/hermes-edge
Would love feedback from the community — especially on routing rule coverage and multi-model loading strategies on iOS/Android.