Spark Hermes Cost Router β€” local Spark $0 β†’ OpenRouter cheap β†’ frontier

A deterministic three-tier cost router for the NVIDIA DGX Spark (GB10, 128 GB unified): every prompt routes to the cheapest tier whose deterministic predicate clears, with the local Qwen3-30B-A3B MoE as the always-warm $0 floor and OpenRouter overflow tiers for the prompts local can't reliably answer.

What this harness is

When does local stop being enough? Measure first, then route.

A Spark holds one strong model warm at a time and pays no per-token cost for it. A frontier model on OpenRouter is the per-token-billed ceiling. The interesting decision is when to escalate β€” and the only honest answer is the measured leak rate, not the public-docs 60-80% cost-savings pitch. This router ships the predicates that decide, plus the snapshot prices that let you reproduce the dollar curve.

Good for:

  • Route a Hermes agent prompt to the cheapest tier whose predicate clears (local Spark / OpenRouter cheap / OpenRouter frontier).
  • Embed a deterministic, auditable cost router into a Hermes config (no LLM-classifier overhead).
  • Reproduce the H6 leak-rate measurement on a custom workload by re-grading the same 12-prompt shape.

For: DGX Spark power users running a local-first agent harness who want to escalate to frontier only when local can't reliably answer β€” and to know what that fraction is.

Serving lanes

Lane Provider Model tok/s Sustained (min) $/M input $/M output
Local Spark β€” Qwen3-30B-A3B MoE Q4_K_M llama-server Qwen3-30B-A3B-Q4_K_M.gguf β€” β€” $0.00/M $0.00/M
OpenRouter cheap-tier β€” gpt-4o-mini openrouter openai/gpt-4o-mini β€” β€” $0.15/M $0.60/M
OpenRouter frontier β€” claude-opus-4.1 openrouter anthropic/claude-opus-4.1 β€” β€” $15.00/M $75.00/M

The dollar columns are the per-tier price snapshot at measurement time. The cost-routed strategy's per-task spend is a weighted average across tiers β€” see the article for the measured curve.

Configuration

~/.hermes/config.yaml (model block):

model:
  provider: custom
  base_url: "http://127.0.0.1:8080/v1"
  default: Qwen3-30B-A3B-Q4_K_M.gguf

~/.hermes/.env:

HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8080/v1
OPENROUTER_API_KEY=<your-openrouter-key>

router.yaml:

router:
  kind: cost
  tiers:
    - name: simple
      endpoint: "http://127.0.0.1:8080/v1"
      model: Qwen3-30B-A3B-Q4_K_M.gguf
      notes: "Local Spark lane via llama.cpp on :8080; the Step-2 pinned brain. $0 marginal."
    - name: standard
      endpoint: "https://openrouter.ai/api/v1"
      model: openai/gpt-4o-mini
      complexity_keywords:
        - summarize
        - compare
        - analyze
      min_input_tokens: 600
      price_per_m_input_usd: 0.15
      price_per_m_output_usd: 0.6
      api_key_env: OPENROUTER_API_KEY
      notes: OpenRouter cheap-tier value pick (128K ctx).
    - name: complex
      endpoint: "https://openrouter.ai/api/v1"
      model: anthropic/claude-opus-4.1
      complexity_keywords:
        - prove
        - derive
        - multi-step
        - step through
      min_input_tokens: 3000
      price_per_m_input_usd: 15.0
      price_per_m_output_usd: 75.0
      api_key_env: OPENROUTER_API_KEY
      notes: OpenRouter frontier-tier hard-reasoning pick (200K ctx).

Doctor checklist

  • Local brain warm on :8080 (always-on)
  • OPENROUTER_API_KEY set in ~/.hermes/.env
  • Cost-routed strategy pass-rate β‰₯ 92% on the 12-prompt bench
  • Leak rate ≀ 33.3% (where local failed but frontier passed)
  • Frontier-only spend ≀ $1.06 per 36 calls (the ceiling)
  • Cost-routed spend β‰ˆ $0.79 per 36 calls (the production target)

Methods

Measured and documented in Cost-Routing the Hermes Harness β€” When Local Stops Being Enough.

Known drift

  • Suite size β€” 12 prompts Γ— N=3 attempts per strategy (108 calls per full run). Not a large-N guarantee; production workloads will exhibit their own leak rates.
  • OpenRouter snapshot prices β€” Captured 2026-05-28T14:32:06.836115+00:00. openai/gpt-4o-mini = $0.15/M input + $0.6/M output; anthropic/claude-opus-4.1 = $15.0/M input + $75.0/M output. Prices change; re-snapshot before reproduction.
  • Leak rate β€” 33.3% measured leak rate. Tuned to this 12-prompt suite's synthetic-but-graded difficulty distribution.
  • Token threshold β€” complex-tier min_input_tokens=3000 was tuned to this suite. A workload with a different long-to-short ratio should re-tune this single integer.

Published by Orionfold LLC Β· orionfold.com Β· Methods documented at ainative.business/field-notes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support