Instructions to use Orionfold/spark-hermes-profile with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use Orionfold/spark-hermes-profile with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Spark Hermes Profile — serving-lane bakeoff
A measured Hermes Agent serving-lane profile for the NVIDIA DGX Spark (GB10, 128 GB unified memory): NIM, vLLM, and llama.cpp lanes benchmarked for throughput, sustained load, and tool-call reliability.
What this harness is
Which local lane should drive your always-on Spark agent?
Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.
Good for:
- Pick a local serving lane for a Hermes agent on the Spark
- Size a MoE vs dense model against the 128 GB unified-memory envelope
- Reproduce the tool-call-reliability + tok/s + sustained-load numbers
For: DGX Spark power users running a local, no-API-key agent harness.
Serving lanes
| Lane | Provider | Model | tok/s | Sustained (min) | Format-error | Clean-run |
|---|---|---|---|---|---|---|
| NIM · Nemotron-Nano-9B-v2 | nim | nvidia/nemotron-nano-9b-v2 | 27.7 | — | 0.0% | 100.0% |
| llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) ⭐ | llama-server | Qwen3-30B-A3B-Q4_K_M | 88 | 3 | 0.0% | 100.0% |
| llama.cpp · Qwen3-32B (dense, Q4_K_M) | llama-server | Qwen3-32B-Q4_K_M | 10.2 | 3.4 | 0.0% | 100.0% |
| vLLM · Qwen3-30B-A3B (MoE, FP8) | vllm | Qwen/Qwen3-30B-A3B-FP8 | 55.9 | 3.1 | 0.0% | 100.0% |
| vLLM · Qwen3-32B (dense, FP8) | vllm | Qwen/Qwen3-32B-FP8 | 6.6 | 3.2 | 0.0% | 100.0% |
Tool-call format-error rate is the agent-critical number: a lane that can't emit well-formed tool calls is disqualified regardless of speed.
Configuration
~/.hermes/config.yaml (model block):
model:
provider: custom
base_url: "http://127.0.0.1:8000/v1"
default: nvidia/nemotron-nano-9b-v2
~/.hermes/.env:
HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8000/v1
Doctor checklist
- hermes doctor — all core-section checks green
- Serving lane warm and answering /v1/models
- First
hermes -zagent turn runs locally, no API key
Methods
Measured and documented in The Hermes serving lane on a DGX Spark.
Known drift
- Tool-call reliability sample size — format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
- Qwen3 context vs Hermes minimum — Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.
Published by Orionfold LLC · orionfold.com · Methods documented at ainative.business/field-notes.