Instructions to use Orionfold/spark-hermes-cost-router with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use Orionfold/spark-hermes-cost-router with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Spark Hermes Cost Router β local Spark $0 β OpenRouter cheap β frontier
A deterministic three-tier cost router for the NVIDIA DGX Spark (GB10, 128 GB unified): every prompt routes to the cheapest tier whose deterministic predicate clears, with the local Qwen3-30B-A3B MoE as the always-warm $0 floor and OpenRouter overflow tiers for the prompts local can't reliably answer.
What this harness is
When does local stop being enough? Measure first, then route.
A Spark holds one strong model warm at a time and pays no per-token cost for it. A frontier model on OpenRouter is the per-token-billed ceiling. The interesting decision is when to escalate β and the only honest answer is the measured leak rate, not the public-docs 60-80% cost-savings pitch. This router ships the predicates that decide, plus the snapshot prices that let you reproduce the dollar curve.
Good for:
- Route a Hermes agent prompt to the cheapest tier whose predicate clears (local Spark / OpenRouter cheap / OpenRouter frontier).
- Embed a deterministic, auditable cost router into a Hermes config (no LLM-classifier overhead).
- Reproduce the H6 leak-rate measurement on a custom workload by re-grading the same 12-prompt shape.
For: DGX Spark power users running a local-first agent harness who want to escalate to frontier only when local can't reliably answer β and to know what that fraction is.
Serving lanes
| Lane | Provider | Model | tok/s | Sustained (min) | $/M input | $/M output |
|---|---|---|---|---|---|---|
| Local Spark β Qwen3-30B-A3B MoE Q4_K_M | llama-server | Qwen3-30B-A3B-Q4_K_M.gguf | β | β | $0.00/M | $0.00/M |
| OpenRouter cheap-tier β gpt-4o-mini | openrouter | openai/gpt-4o-mini | β | β | $0.15/M | $0.60/M |
| OpenRouter frontier β claude-opus-4.1 | openrouter | anthropic/claude-opus-4.1 | β | β | $15.00/M | $75.00/M |
The dollar columns are the per-tier price snapshot at measurement time. The cost-routed strategy's per-task spend is a weighted average across tiers β see the article for the measured curve.
Configuration
~/.hermes/config.yaml (model block):
model:
provider: custom
base_url: "http://127.0.0.1:8080/v1"
default: Qwen3-30B-A3B-Q4_K_M.gguf
~/.hermes/.env:
HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8080/v1
OPENROUTER_API_KEY=<your-openrouter-key>
router.yaml:
router:
kind: cost
tiers:
- name: simple
endpoint: "http://127.0.0.1:8080/v1"
model: Qwen3-30B-A3B-Q4_K_M.gguf
notes: "Local Spark lane via llama.cpp on :8080; the Step-2 pinned brain. $0 marginal."
- name: standard
endpoint: "https://openrouter.ai/api/v1"
model: openai/gpt-4o-mini
complexity_keywords:
- summarize
- compare
- analyze
min_input_tokens: 600
price_per_m_input_usd: 0.15
price_per_m_output_usd: 0.6
api_key_env: OPENROUTER_API_KEY
notes: OpenRouter cheap-tier value pick (128K ctx).
- name: complex
endpoint: "https://openrouter.ai/api/v1"
model: anthropic/claude-opus-4.1
complexity_keywords:
- prove
- derive
- multi-step
- step through
min_input_tokens: 3000
price_per_m_input_usd: 15.0
price_per_m_output_usd: 75.0
api_key_env: OPENROUTER_API_KEY
notes: OpenRouter frontier-tier hard-reasoning pick (200K ctx).
Doctor checklist
- Local brain warm on
:8080(always-on) -
OPENROUTER_API_KEYset in~/.hermes/.env - Cost-routed strategy pass-rate β₯ 92% on the 12-prompt bench
- Leak rate β€ 33.3% (where local failed but frontier passed)
- Frontier-only spend β€ $1.06 per 36 calls (the ceiling)
- Cost-routed spend β $0.79 per 36 calls (the production target)
Methods
Measured and documented in Cost-Routing the Hermes Harness β When Local Stops Being Enough.
Known drift
- Suite size β 12 prompts Γ N=3 attempts per strategy (108 calls per full run). Not a large-N guarantee; production workloads will exhibit their own leak rates.
- OpenRouter snapshot prices β Captured 2026-05-28T14:32:06.836115+00:00. openai/gpt-4o-mini = $0.15/M input + $0.6/M output; anthropic/claude-opus-4.1 = $15.0/M input + $75.0/M output. Prices change; re-snapshot before reproduction.
- Leak rate β 33.3% measured leak rate. Tuned to this 12-prompt suite's synthetic-but-graded difficulty distribution.
- Token threshold β complex-tier
min_input_tokens=3000was tuned to this suite. A workload with a different long-to-short ratio should re-tune this single integer.
Published by Orionfold LLC Β· orionfold.com Β· Methods documented at ainative.business/field-notes.