Spark Hermes Profile — serving-lane bakeoff

A measured Hermes Agent serving-lane profile for the NVIDIA DGX Spark (GB10, 128 GB unified memory): NIM, vLLM, and llama.cpp lanes benchmarked for throughput, sustained load, and tool-call reliability.

What this harness is

Which local lane should drive your always-on Spark agent?

Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.

Good for:

Pick a local serving lane for a Hermes agent on the Spark
Size a MoE vs dense model against the 128 GB unified-memory envelope
Reproduce the tool-call-reliability + tok/s + sustained-load numbers

For: DGX Spark power users running a local, no-API-key agent harness.

Serving lanes

Lane	Provider	Model	tok/s	Sustained (min)	Format-error	Clean-run
NIM · Nemotron-Nano-9B-v2	nim	nvidia/nemotron-nano-9b-v2	27.7	—	0.0%	100.0%
llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) ⭐	llama-server	Qwen3-30B-A3B-Q4_K_M	88	3	0.0%	100.0%
llama.cpp · Qwen3-32B (dense, Q4_K_M)	llama-server	Qwen3-32B-Q4_K_M	10.2	3.4	0.0%	100.0%
vLLM · Qwen3-30B-A3B (MoE, FP8)	vllm	Qwen/Qwen3-30B-A3B-FP8	55.9	3.1	0.0%	100.0%
vLLM · Qwen3-32B (dense, FP8)	vllm	Qwen/Qwen3-32B-FP8	6.6	3.2	0.0%	100.0%

Tool-call format-error rate is the agent-critical number: a lane that can't emit well-formed tool calls is disqualified regardless of speed.

Configuration

~/.hermes/config.yaml (model block):

model:
  provider: custom
  base_url: "http://127.0.0.1:8000/v1"
  default: nvidia/nemotron-nano-9b-v2

~/.hermes/.env:

HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8000/v1

Doctor checklist

hermes doctor — all core-section checks green
Serving lane warm and answering /v1/models
First hermes -z agent turn runs locally, no API key

Methods

Measured and documented in The Hermes serving lane on a DGX Spark.

Known drift

Tool-call reliability sample size — format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
Qwen3 context vs Hermes minimum — Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.

Published by Orionfold LLC · orionfold.com · Methods documented at ainative.business/field-notes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support