Spark Hermes Profile — serving-lane bakeoff

A measured Hermes Agent serving-lane profile for the NVIDIA DGX Spark (GB10, 128 GB unified memory): NIM, vLLM, and llama.cpp lanes benchmarked for throughput, sustained load, and tool-call reliability.

What this harness is

Which local lane should drive your always-on Spark agent?

Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.

Good for:

  • Pick a local serving lane for a Hermes agent on the Spark
  • Size a MoE vs dense model against the 128 GB unified-memory envelope
  • Reproduce the tool-call-reliability + tok/s + sustained-load numbers

For: DGX Spark power users running a local, no-API-key agent harness.

Serving lanes

Lane Provider Model tok/s Sustained (min) Format-error Clean-run
NIM · Nemotron-Nano-9B-v2 nim nvidia/nemotron-nano-9b-v2 27.7 0.0% 100.0%
llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) ⭐ llama-server Qwen3-30B-A3B-Q4_K_M 88 3 0.0% 100.0%
llama.cpp · Qwen3-32B (dense, Q4_K_M) llama-server Qwen3-32B-Q4_K_M 10.2 3.4 0.0% 100.0%
vLLM · Qwen3-30B-A3B (MoE, FP8) vllm Qwen/Qwen3-30B-A3B-FP8 55.9 3.1 0.0% 100.0%
vLLM · Qwen3-32B (dense, FP8) vllm Qwen/Qwen3-32B-FP8 6.6 3.2 0.0% 100.0%

Tool-call format-error rate is the agent-critical number: a lane that can't emit well-formed tool calls is disqualified regardless of speed.

Configuration

~/.hermes/config.yaml (model block):

model:
  provider: custom
  base_url: "http://127.0.0.1:8000/v1"
  default: nvidia/nemotron-nano-9b-v2

~/.hermes/.env:

HERMES_STREAM_READ_TIMEOUT=1800
OPENAI_API_KEY=local
OPENAI_BASE_URL=http://127.0.0.1:8000/v1

Doctor checklist

  • hermes doctor — all core-section checks green
  • Serving lane warm and answering /v1/models
  • First hermes -z agent turn runs locally, no API key

Methods

Measured and documented in The Hermes serving lane on a DGX Spark.

Known drift

  • Tool-call reliability sample size — format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
  • Qwen3 context vs Hermes minimum — Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.

Published by Orionfold LLC · orionfold.com · Methods documented at ainative.business/field-notes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support