Ornstein-3.5-9B-V2 ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of GestaltLabs/Ornstein-3.5-9B-V2 (Qwen3.5 9B + RLVR/GRPO post-training, 9.2 B params, native multimodal with vision tower, single-file MTP head for speculative decoding).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-28 with build commit 11d76c2.

File Size Quant BPW
Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf 4.84 GB Q4_0_ROCMFP4_STRIX_LEAN (4-bit ROCmFP4 + Strix K/V + Q5_K embed) 4.42

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. The ROCmFP4 weight format is unknown to stock llama.cpp and will fail with unknown quantization.

Multimodal note: this is a vision-capable model. To use the vision tower, also load the mmproj-ornstein-v2-f16.gguf (~921 MB) companion file via --mmproj <path>. The companion is published separately by GestaltLabs in Ornstein-3.5-9B-V2-GGUF. Vision is verified to work in mesh_eval (1×1 test pixel identified as "Red").

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

  • Harness scope is bounded. The numbers below come from the mesh's mesh_eval (6 tests, 4 deterministic + throughput + vision) + hermes_loop_eval (5 agent scenarios) + a ctx_scaling_bench run at 4 K → 32 K (64 K+ blocked by harness HTTP timeout, not model capability). That's a regression suite, not a quality benchmark — it answers "does this quant still serve the mesh's agent stack correctly," not "is this the best possible 4-bit ROCmFP4 quant of this model."
  • Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
  • No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory. If you need a quality signal, charlie's own validation ladder or an lm-eval-harness run is the right tool. (Note: GestaltLabs's own published GPQA / reasoning numbers on the parent V2 model are 1.00 / 1.00 on the GBS-200 suite — those are upstream's numbers, not ours.)
  • Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
  • No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) handles the vision path, (e) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo GestaltLabs/Ornstein-3.5-9B-V2 (which itself includes a published GBS-200 benchmark table) and the model's stock GGUF variants on GestaltLabs/Ornstein-3.5-9B-V2-GGUF and mradermacher/Ornstein-3.5-9B-V1.5-i1-GGUF are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: ornstein-v2-f16.gguf (F16, 18.4 GB — includes the single-file MTP draft head baked into the same file) sourced from GestaltLabs/Ornstein-3.5-9B-V2-GGUF Companion file for vision: mmproj-ornstein-v2-f16.gguf (~921 MB) — not part of this upload, see GestaltLabs repo Same-stack comparison: none — the only same-source Q3_0_ROCMFPX quant for Ornstein we have is a baseline reference, not a same-harness A/B (different sampler settings, different time). The headline below is the model itself, not a comparison.

Throughput (mesh_eval, 4 K ctx, MTP-ON, turbo4 KV, rep_pen=1.1)

3 reps of 256-token completion, gen t/s mean 47.2 ± 0.4 (45.3, 47.5, 47.4 individual reps). This matches Vibetuned STRIX_LEAN's 47-48 t/s range on the same Node B card despite Ornstein being 40 % smaller (4.84 GB vs 7.0 GB).

Agent / loop validation (raw JSON: raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json)

mesh_eval.py 4 deterministic + vision (raw-mesh-eval-ornstein-v2-strix-lean.json):

Test Result
gibberish (no degenerate repetition) OK (47 words, 0 repeated chars)
thinking_leak (no <think> leakage) CLEAN
tool_calling (single call) PASS — get_weather(location=Tokyo)
coding (merge_sorted_lists) PASS — correct two-pointer impl, tests pass
uncensored (no refusal) PASS — ss -tuln answer
throughput (3×256-token gen) 47.2 t/s mean, ±0.4 stdev
vision (1×1 test pixel) PASS — identifies "Red"
overall_status PASS, 4/4 + vision

hermes_loop_eval.py 5 scenarios with rep_pen=1.1 (raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json):

Scenario Result avg t/s
single (one tool call) PASS — final answer correct 33.9
chained (calc → use result) PASS — 15 × 37 = 555 27.9
multi_step (compare 2 cities) PASS — table + conclusion 39.7
search (web search + extract) PASS — Eiffel Tower height 25.6
error_recovery (file not found) PASS — clean 25.4
overall_status PASS, 5/5 mean 30.5

rep_pen=1.1 is required for this model (and for Fable5, the other RLVR/GRPO-trained model in the mesh's stack). Without it, the model loops on chained and multi_step scenarios — the same tool-loop pattern the mesh's Fable5 work hit. With rep_pen=1.1 the model passes 5/5. The baseline (default sampler) was 3/5; that raw JSON is included as raw-hermes-loop-ornstein-v2-strix-lean-baseline.json for reference. The 2 missing scenarios are tool-loop failures, not quant defects.

Context scaling (raw JSON: ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json)

Ctx target pp t/s tg t/s Result
4 K 1081 (per server logs) PASS (per SUMMARY.md)
8 K 701 (per server logs) PASS
16 K 540 (per server logs) PASS
32 K 1140 50.0 PASS, server healthy
64 K n/a n/a harness HTTP timeout (120s), not a model defect
128 K n/a n/a server OOM at 8 GB cache-ram; resolved with cram=24576

Findings:

  • 32 K prompt processing holds at 1140 pp t/s — the model handles 32 K comfortably on a 16 GB card with KV offload.
  • Decode throughput holds at 50 t/s at 32 K (matches Ornith 9B and Vibetuned 14B on the same card).
  • 64 K+ ctx scaling is harness-limited, not model-limited. The harness's 120 s urlopen timeout blocks measurement before the model can finish. Server health at 128 K is verified separately (with cram=24576, server is healthy at 71 % VRAM and processes 64 K+ prompts in ~7 min). The ctx_scaling_bench harness needs a longer HTTP timeout for proper 64 K+ measurement — that's a separate follow-up, not a model issue.
  • The 128 K test that initially OOM'd was due to cache-ram=8192 being too small for 128 K; bumping to cache-ram=24576 (24 GB DDR4 budget on Node B's 48 GB) resolves it.

KV cache type (head_dim=128, same as Ornith + Vibetuned)

The mesh's KV-type sweep was run on the head_dim=128 Qwen family. turbo4 is the production default for any head_dim=128 model in the ROCmFPX build: -0.7-1.1 GB VRAM, same throughput vs q8_0. See the Ornith 9B ROCmFPX STRIX_LEAN repo for the full sweep data. turbo3/4 are TheTom's turboquant types, absorbed into ROCmFPX main via PlunderStruck commits d859c9e + d0141e8.

Quick start

# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Download the mmproj companion for vision (separately published by GestaltLabs)
# wget https://huggingface.co/GestaltLabs/Ornstein-3.5-9B-V2-GGUF/resolve/main/mmproj-ornstein-v2-f16.gguf

# Serve (131 072 ctx, turbo4 KV for head_dim=128, fa=on, MTP-ON, rep_pen=1.1)
./build/bin/llama-server \
  -m Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf \
  --mmproj mmproj-ornstein-v2-f16.gguf \
  -np 1 -c 131072 \
  -ctk turbo4 -ctv turbo4 \
  -kvo -cram 24576 -fa on \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 --spec-draft-p-min 0.75 \
  --repeat-penalty 1.1

--spec-type draft-mtp is the correct flag, not mtp (the --spec-type mtp form in the upstream GestaltLabs HF model card is a typo; the ROCmFPX llama-server rejects mtp with a list of valid types). This is the same typo pattern that hit the SABER card upstream.

Reproduce the quant

# Source (F16 GGUF with MTP head baked in, from GestaltLabs)
SRC=/mnt/e/llms-models-data/ornstein/ornstein-v2-f16.gguf

# ROCmFPX llama-quantize (preset is built in; see `llama-quantize --help`)
~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~4 min for 18.4 GB F16 source, CPU-only, no GPU required.

Files in this repo

File What it is
Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf The quant. Load only with a ROCmFPX llama-server.
README.md This file
raw-mesh-eval-ornstein-v2-strix-lean.json mesh_eval.py output (2026-06-29 01:04 UTC) — 4/4 + vision
raw-hermes-loop-ornstein-v2-strix-lean-baseline.json hermes_loop_eval.py output WITHOUT rep_pen (2026-06-29 01:05 UTC) — 3/5
raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json hermes_loop_eval.py output WITH rep_pen=1.1 (2026-06-29 01:06 UTC) — 5/5
ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json 4 K → 32 K ctx scaling (32 K pp 1140, tg 50)
ctx-scaling-ornstein-v2-strix-lean-20260628-213758.json 64 K / 128 K attempt (harness timeout)
quant-command.sh The exact llama-quantize invocation used

Not in this repo (intentionally): the mmproj-ornstein-v2-f16.gguf (~921 MB) is a separate file published by GestaltLabs in Ornstein-3.5-9B-V2-GGUF. The model card for the parent quant list explicitly says "GGUF" includes the mmproj in the same repo. We don't redistribute it here to avoid a third-party redistribution; download directly from GestaltLabs.

What's NOT in this repo (caveats)

  • Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX. Use that fork's llama-server/llama-cli/llama-quantize.
  • No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression (charlie12345/rocmfp4-llama issue #6) — we did not test it.
  • 64 K+ ctx scaling is harness-limited, not model-limited. The ctx_scaling_bench.py 120 s HTTP timeout blocks measurement at 64 K. The model itself handles 128 K ctx (verified separately with cram=24576). Proper 64 K+ numbers will require a harness fix (longer timeout or async polling) — that's a separate follow-up.
  • The source GGUF is GestaltLabs-distributed (per general.quantized_by in the F16 source metadata). The actual parent is GestaltLabs/Ornstein-3.5-9B-V2 (the safetensors model), itself a finetune of GestaltLabs/Ornstein-3.5-9B-V1.5, itself a finetune of Qwen/Qwen3.5-9B. The chain is: Qwen3.5-9B → V1.5 (SFT) → V2 (DPO + GRPO/RLVR post-training) → GestaltLabs F16 GGUF → our STRIX_LEAN.
  • 5 GB minimum VRAM for the GGUF alone; 12 GB with KV offload at 128 K. The mesh's 16 GB card runs it with ~3 GB headroom at 128 K ctx.
  • rep_pen=1.1 is mandatory for the agent loop. Without it, the model loops on chained and multi_step (3/5 PASS). This is the same Fable5 tool-loop pattern — a property of the RLVR/GRPO SFT family, not the quant. The fix is universal: add --repeat-penalty 1.1 to the serve command. (Note: this is unusual for an Apache-2.0 release to require; upstream's HF card does not document it. A friendly bug report to GestaltLabs is the right next step.)
  • --spec-type mtp is a typo in the upstream HF model card. The correct flag for llama-server is --spec-type draft-mtp. The mtp form is rejected with a list of valid types. This is a separate upstream bug.
  • Vision requires the mmproj-ornstein-v2-f16.gguf companion file. Not bundled in this repo; download from GestaltLabs/Ornstein-3.5-9B-V2-GGUF. The model card there labels the mmproj as mmproj-ornstein-v2-f16.gguf and notes it covers image + video input.
  • No MTP / speculative-decode sweep on this file beyond the default --spec-draft-n-max 3 --spec-draft-p-min 0.75. The mesh's MTP sweep was on Fable5 (Node D CUDA, +81% decode on Fable5 Q8 + MTP). Ornstein MTP settings may have a different optimal; we used the upstream-recommended values.
  • No quality benchmark (perplexity, MMLU, GSM8K). GestaltLabs's own published GPQA / reasoning numbers on the parent V2 model are 1.00 / 1.00 on the GBS-200 suite (per their HF card) — those are upstream's numbers, not ours.

Provenance

  • Source model: GestaltLabs/Ornstein-3.5-9B-V2 — 9.2 B params, Qwen3.5 9B base + DPO + GRPO/RLVR post-training, vision tower + MTP head baked in
  • Source model license: apache-2.0
  • Source GGUF uploader: GestaltLabs (the model authors themselves)
  • Companion file: mmproj-ornstein-v2-f16.gguf (~921 MB) in GestaltLabs/Ornstein-3.5-9B-V2-GGUF (NOT in this repo — see "Files in this repo" above)
  • Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
  • Quantizer license: MIT
  • Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
  • Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
  • Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py + scripts/mesh-bench/ctx_scaling_bench.py from the meshina repo (private)
  • Original bench report: raw/benchmarks/2026-06-28-ornstein-charlie-bench/SUMMARY.md in the meshina repo (177 lines, full session record + cross-model comparison + 6 caveats)
  • Research note on 27B Ornstein feasibility: raw/research/2026-06-28-ornstein-27b-charlie-size-math.md (concludes 27B Ornstein is not feasible on 16 GB at 128 K; defer to 24 GB+ hardware)

License

  • The Ornstein 3.5 9B V2 parent is apache-2.0 (per its HF model card).
  • The charlie12345/ROCmFPX quantizer is MIT.
  • The GGUF in this repo is a derivative of the apache-2.0 parent, produced with the MIT-licensed quantizer. Both upstream licenses are preserved.
Downloads last month
259
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Quantized
(4)
this model

Collection including maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF