NVFP4-GB10 production feedback + reproducible-bench setup β€” recipe?

#1
by nooneknows1 - opened

Hi Michael,

Quick context: I'm running your Qwen3-Coder-Next-NVFP4-GB10 in production on a DGX Spark via LocalAI + vLLM 0.23.0 with FlashInfer-Cutlass kernels. Real workload is a Hermes-CLI coding agent doing tool-call sessions, hitting ~62 tok/s steady-state. Solid build β€” thanks for the work.

I'm currently building a small reproducible benchmark suite (TTFT, throughput, HumanEval-pass-rate, tool-call compliance) that compares NVFP4 builds against each other under realistic streaming-with-tools workloads. The plan is to feed the results back to anyone whose build I include. As context for the benchmark itself: I just landed a vLLM streaming PR for progressive emission with active tool parsers (mudler/LocalAI#10351, with E2E numbers against your build).

Would you be open to sharing your llm-compressor recipe (oneshot/Quant config + recipe yaml/py)? I'd like to make sure I'm reproducing your build exactly as the baseline before testing variant calibrations (code-focused datasets, larger sample counts, ignore-list tweaks). Happy to keep it private if you'd prefer not to publish it publicly yet.

In return I can share the benchmark harness once it's stable and feed quality-numbers back to you per build β€” useful signal if you want to publish better quality cards on the models.

If you'd rather not share the recipe β€” totally fine, I'll reconstruct from the model-card hints. Mostly wanted to ask first since you've got the ground-truth setup.

Cheers,
Philipp

Closed β€” switching to a private DM instead. Apologies for the noise.

nooneknows1 changed discussion status to closed

Re-opening β€” apologies for the noise.

nooneknows1 changed discussion status to open

Sign up or log in to comment