Spark Hermes Vertical Router — 5 specialists + 1 default brain

A deterministic keyword-classifier router for the NVIDIA DGX Spark (GB10, 128 GB unified memory): dispatch each Hermes prompt to one of five Orionfold vertical GGUFs — patent / legal / finance / cyber / medical — served one-at-a-time, with a strong default brain (Qwen3-30B-A3B MoE Q4_K_M) catching everything else.

What this harness is

One always-on brain, five specialists, zero LLM-classifier overhead.

A Spark holds one strong model warm at a time. The pinned MoE is excellent at general agentic work but is not your domain expert. The five Orionfold vertical GGUFs are domain experts but compete for the same 128 GB envelope. A router picks per prompt: keyword-matched prompts get the right specialist (warm on demand, ~5–10 s), everything else stays with the brain.

Good for:

Route a Hermes agent prompt to a vertical specialist by keyword
Reproduce the 30-prompt router-accuracy + per-vertical quality bench
Embed a deterministic, auditable router into a Hermes config

For: DGX Spark power users running a local, no-API-key agent harness across multiple domains.

Serving lanes

Lane	Provider	Model	tok/s	Sustained (min)	Format-error	Clean-run
Patent prosecution	llama-server	Orionfold/patent-strategist-v3-nemo-GGUF Q5_K_M	—	—	—	80.0%
Legal reasoning	llama-server	Orionfold/Saul-7B-Instruct-v1-GGUF Q5_K_M	—	—	—	80.0%
Financial analysis	llama-server	Orionfold/finance-chat-GGUF Q5_K_M	—	—	—	100.0%
Defensive cyber	llama-server	Orionfold/SecurityLLM-GGUF Q5_K_M	—	—	—	100.0%
Clinical reasoning	llama-server	Orionfold/II-Medical-8B-GGUF Q5_K_M	—	—	—	100.0%
Default brain (MoE) ⭐	llama-server	Qwen/Qwen3-30B-A3B-Q4_K_M	—	—	—	80.0%

Tool-call format-error rate is the agent-critical number: a lane that can't emit well-formed tool calls is disqualified regardless of speed.

Configuration

~/.hermes/config.yaml (model block):

model:
  provider: custom
  base_url: "http://127.0.0.1:8080/v1"
  default: Qwen3-30B-A3B-Q4_K_M.gguf

~/.hermes/.env:

HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8080/v1

router.yaml:

router:
  kind: vertical
  default:
    name: brain
    hf_repo: Qwen/Qwen3-30B-A3B-Q4_K_M
    variant: Q4_K_M
    description: "Qwen3-30B-A3B MoE Q4_K_M — the Step-2 pinned default brain (8/8 quality, 83.5 tok/s, 31.8 GB; the always-warm fallback that picks up any prompt no vertical claims). Always served on 127.0.0.1:8080 via llama.cpp."
  routes:
    - name: patent
      hf_repo: Orionfold/patent-strategist-v3-nemo-GGUF
      variant: Q5_K_M
      keywords:
        - patent
        - claim
        - prior art
        - uspto
        - mpep
        - prosecution
        - §102
        - §103
        - provisional
        - patentability
        - patentable
        - infring
      description: "Offline patent-prosecution reasoning (claims, prosecution, MPEP)."
    - name: legal
      hf_repo: Orionfold/Saul-7B-Instruct-v1-GGUF
      variant: Q5_K_M
      keywords:
        - lawsuit
        - contract
        - tort
        - statute
        - plaintiff
        - defendant
        - breach
        - negligence
        - estoppel
        - doctrine
        - res ipsa
        - limitations
        - repose
        - promissory
        - sue
      description: "Legal reasoning over contracts, torts, and statutes (Saul-7B-Instruct)."
    - name: finance
      hf_repo: Orionfold/finance-chat-GGUF
      variant: Q5_K_M
      keywords:
        - portfolio
        - 10-k
        - ebitda
        - dividend
        - fed
        - yield curve
        - sharpe
        - moat
        - valuation
        - earnings
        - p/e
        - stock
        - shareholder
        - balance sheet
        - cash flow
      description: "Financial analysis, 10-K reading, valuation primitives."
    - name: cyber
      hf_repo: Orionfold/SecurityLLM-GGUF
      variant: Q5_K_M
      keywords:
        - cve
        - exploit
        - malware
        - rce
        - owasp
        - siem
        - vulnerab
        - phishing
        - ransomware
        - lpe
        - privilege escalation
        - infection
        - intrusion
        - ddos
      description: "Defensive cybersecurity: CVE triage, OWASP, incident response."
    - name: medical
      hf_repo: Orionfold/II-Medical-8B-GGUF
      variant: Q5_K_M
      keywords:
        - symptom
        - diagnosis
        - icd-10
        - pathology
        - dose
        - mg/kg
        - amoxicillin
        - d-dimer
        - differential
        - infarct
        - ed patient
        - embolism
        - thromb
        - presents with
        - mg/dl
        - myocardial
      description: "Clinical reasoning: differentials, dosing, ICD-10, ED workups."

Doctor checklist

Router accuracy ≥ 100% on the 30-prompt bench
Overall vertical pass-rate ≥ 90%
Default brain warm on :8080 (always-on)
Vertical port :8090 free between verticals
All 5 vertical GGUFs cached at /home/nvidia/data/quants/

Methods

Measured and documented in The Hermes vertical router on a DGX Spark.

Known drift

Router-accuracy sample size — router classification measured over 30 prompts (5 per vertical + 5 default-brain) — not a large-N guarantee.
Keyword-set tuning — vertical keywords were tuned against the bench prompts; out-of-distribution prompts may misroute.
Per-vertical pass-rate basis — 5 prompts per vertical; deterministic substring/regex rubrics — open-ended answers (haiku, drafted claims) marked vibe.
One-at-a-time vertical serving — verticals are served on demand on :8090 (~5–10s warm); the default brain stays warm on :8080 (always-on, ~32 GB).

Published by Orionfold LLC · orionfold.com · Methods documented at ainative.business/field-notes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support