Instructions to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF",
	filename="Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./llama-cli -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Use Docker

docker model run hf.co/maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

LM Studio
Jan
Ollama
How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Ollama:
```
ollama run hf.co/maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
```

Unsloth Studio

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF to start chatting

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Docker Model Runner:
```
docker model run hf.co/maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF
```

Lemonade

How to use maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Run and chat with the model

lemonade run user.Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Ornstein-3.5-9B-V2 ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of GestaltLabs/Ornstein-3.5-9B-V2 (Qwen3.5 9B + RLVR/GRPO post-training, 9.2 B params, native multimodal with vision tower, single-file MTP head for speculative decoding).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-28 with build commit 11d76c2.

File	Size	Quant	BPW
`Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf`	4.84 GB	`Q4_0_ROCMFP4_STRIX_LEAN` (4-bit ROCmFP4 + Strix K/V + Q5_K embed)	4.42

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. The ROCmFP4 weight format is unknown to stock llama.cpp and will fail with unknown quantization.

Multimodal note: this is a vision-capable model. To use the vision tower, also load the mmproj-ornstein-v2-f16.gguf (~921 MB) companion file via --mmproj <path>. The companion is published separately by GestaltLabs in Ornstein-3.5-9B-V2-GGUF. Vision is verified to work in mesh_eval (1×1 test pixel identified as "Red").

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

Harness scope is bounded. The numbers below come from the mesh's mesh_eval (6 tests, 4 deterministic + throughput + vision) + hermes_loop_eval (5 agent scenarios) + a ctx_scaling_bench run at 4 K → 32 K (64 K+ blocked by harness HTTP timeout, not model capability). That's a regression suite, not a quality benchmark — it answers "does this quant still serve the mesh's agent stack correctly," not "is this the best possible 4-bit ROCmFP4 quant of this model."
Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory. If you need a quality signal, charlie's own validation ladder or an lm-eval-harness run is the right tool. (Note: GestaltLabs's own published GPQA / reasoning numbers on the parent V2 model are 1.00 / 1.00 on the GBS-200 suite — those are upstream's numbers, not ours.)
Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) handles the vision path, (e) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo GestaltLabs/Ornstein-3.5-9B-V2 (which itself includes a published GBS-200 benchmark table) and the model's stock GGUF variants on GestaltLabs/Ornstein-3.5-9B-V2-GGUF and mradermacher/Ornstein-3.5-9B-V1.5-i1-GGUF are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: ornstein-v2-f16.gguf (F16, 18.4 GB — includes the single-file MTP draft head baked into the same file) sourced from GestaltLabs/Ornstein-3.5-9B-V2-GGUF Companion file for vision: mmproj-ornstein-v2-f16.gguf (~921 MB) — not part of this upload, see GestaltLabs repo Same-stack comparison: none — the only same-source Q3_0_ROCMFPX quant for Ornstein we have is a baseline reference, not a same-harness A/B (different sampler settings, different time). The headline below is the model itself, not a comparison.

Throughput (mesh_eval, 4 K ctx, MTP-ON, turbo4 KV, rep_pen=1.1)

3 reps of 256-token completion, gen t/s mean 47.2 ± 0.4 (45.3, 47.5, 47.4 individual reps). This matches Vibetuned STRIX_LEAN's 47-48 t/s range on the same Node B card despite Ornstein being 40 % smaller (4.84 GB vs 7.0 GB).

Agent / loop validation (raw JSON: `raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json`)

mesh_eval.py 4 deterministic + vision (raw-mesh-eval-ornstein-v2-strix-lean.json):

Test	Result
`gibberish` (no degenerate repetition)	OK (47 words, 0 repeated chars)
`thinking_leak` (no `<think>` leakage)	CLEAN
`tool_calling` (single call)	PASS — `get_weather(location=Tokyo)`
`coding` (merge_sorted_lists)	PASS — correct two-pointer impl, tests pass
`uncensored` (no refusal)	PASS — `ss -tuln` answer
`throughput` (3×256-token gen)	47.2 t/s mean, ±0.4 stdev
`vision` (1×1 test pixel)	PASS — identifies "Red"
`overall_status`	PASS, 4/4 + vision

hermes_loop_eval.py 5 scenarios with rep_pen=1.1 (raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json):

Scenario	Result	avg t/s
`single` (one tool call)	PASS — final answer correct	33.9
`chained` (calc → use result)	PASS — `15 × 37 = 555`	27.9
`multi_step` (compare 2 cities)	PASS — table + conclusion	39.7
`search` (web search + extract)	PASS — Eiffel Tower height	25.6
`error_recovery` (file not found)	PASS — clean	25.4
`overall_status`	PASS, 5/5	mean 30.5

rep_pen=1.1 is required for this model (and for Fable5, the other RLVR/GRPO-trained model in the mesh's stack). Without it, the model loops on chained and multi_step scenarios — the same tool-loop pattern the mesh's Fable5 work hit. With rep_pen=1.1 the model passes 5/5. The baseline (default sampler) was 3/5; that raw JSON is included as raw-hermes-loop-ornstein-v2-strix-lean-baseline.json for reference. The 2 missing scenarios are tool-loop failures, not quant defects.

Context scaling (raw JSON: `ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json`)

Ctx target	pp t/s	tg t/s	Result
4 K	1081	(per server logs)	PASS (per SUMMARY.md)
8 K	701	(per server logs)	PASS
16 K	540	(per server logs)	PASS
32 K	1140	50.0	PASS, server healthy
64 K	n/a	n/a	harness HTTP timeout (120s), not a model defect
128 K	n/a	n/a	server OOM at 8 GB cache-ram; resolved with `cram=24576`

Findings:

32 K prompt processing holds at 1140 pp t/s — the model handles 32 K comfortably on a 16 GB card with KV offload.
Decode throughput holds at 50 t/s at 32 K (matches Ornith 9B and Vibetuned 14B on the same card).
64 K+ ctx scaling is harness-limited, not model-limited. The harness's 120 s urlopen timeout blocks measurement before the model can finish. Server health at 128 K is verified separately (with cram=24576, server is healthy at 71 % VRAM and processes 64 K+ prompts in ~7 min). The ctx_scaling_bench harness needs a longer HTTP timeout for proper 64 K+ measurement — that's a separate follow-up, not a model issue.
The 128 K test that initially OOM'd was due to cache-ram=8192 being too small for 128 K; bumping to cache-ram=24576 (24 GB DDR4 budget on Node B's 48 GB) resolves it.

KV cache type (head_dim=128, same as Ornith + Vibetuned)

The mesh's KV-type sweep was run on the head_dim=128 Qwen family. turbo4 is the production default for any head_dim=128 model in the ROCmFPX build: -0.7-1.1 GB VRAM, same throughput vs q8_0. See the Ornith 9B ROCmFPX STRIX_LEAN repo for the full sweep data. turbo3/4 are TheTom's turboquant types, absorbed into ROCmFPX main via PlunderStruck commits d859c9e + d0141e8.

Quick start

# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Download the mmproj companion for vision (separately published by GestaltLabs)
# wget https://huggingface.co/GestaltLabs/Ornstein-3.5-9B-V2-GGUF/resolve/main/mmproj-ornstein-v2-f16.gguf

# Serve (131 072 ctx, turbo4 KV for head_dim=128, fa=on, MTP-ON, rep_pen=1.1)
./build/bin/llama-server \
  -m Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf \
  --mmproj mmproj-ornstein-v2-f16.gguf \
  -np 1 -c 131072 \
  -ctk turbo4 -ctv turbo4 \
  -kvo -cram 24576 -fa on \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 --spec-draft-p-min 0.75 \
  --repeat-penalty 1.1

--spec-type draft-mtp is the correct flag, not mtp (the --spec-type mtp form in the upstream GestaltLabs HF model card is a typo; the ROCmFPX llama-server rejects mtp with a list of valid types). This is the same typo pattern that hit the SABER card upstream.

Reproduce the quant

# Source (F16 GGUF with MTP head baked in, from GestaltLabs)
SRC=/mnt/e/llms-models-data/ornstein/ornstein-v2-f16.gguf

# ROCmFPX llama-quantize (preset is built in; see `llama-quantize --help`)
~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~4 min for 18.4 GB F16 source, CPU-only, no GPU required.

Files in this repo

File	What it is
`Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN.gguf`	The quant. Load only with a ROCmFPX `llama-server`.
`README.md`	This file
`raw-mesh-eval-ornstein-v2-strix-lean.json`	`mesh_eval.py` output (2026-06-29 01:04 UTC) — 4/4 + vision
`raw-hermes-loop-ornstein-v2-strix-lean-baseline.json`	`hermes_loop_eval.py` output WITHOUT `rep_pen` (2026-06-29 01:05 UTC) — 3/5
`raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json`	`hermes_loop_eval.py` output WITH `rep_pen=1.1` (2026-06-29 01:06 UTC) — 5/5
`ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json`	4 K → 32 K ctx scaling (32 K pp 1140, tg 50)
`ctx-scaling-ornstein-v2-strix-lean-20260628-213758.json`	64 K / 128 K attempt (harness timeout)
`quant-command.sh`	The exact `llama-quantize` invocation used

Not in this repo (intentionally): the mmproj-ornstein-v2-f16.gguf (~921 MB) is a separate file published by GestaltLabs in Ornstein-3.5-9B-V2-GGUF. The model card for the parent quant list explicitly says "GGUF" includes the mmproj in the same repo. We don't redistribute it here to avoid a third-party redistribution; download directly from GestaltLabs.

What's NOT in this repo (caveats)

Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX. Use that fork's llama-server/llama-cli/llama-quantize.
No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression (charlie12345/rocmfp4-llama issue #6) — we did not test it.
64 K+ ctx scaling is harness-limited, not model-limited. The ctx_scaling_bench.py 120 s HTTP timeout blocks measurement at 64 K. The model itself handles 128 K ctx (verified separately with cram=24576). Proper 64 K+ numbers will require a harness fix (longer timeout or async polling) — that's a separate follow-up.
The source GGUF is GestaltLabs-distributed (per general.quantized_by in the F16 source metadata). The actual parent is GestaltLabs/Ornstein-3.5-9B-V2 (the safetensors model), itself a finetune of GestaltLabs/Ornstein-3.5-9B-V1.5, itself a finetune of Qwen/Qwen3.5-9B. The chain is: Qwen3.5-9B → V1.5 (SFT) → V2 (DPO + GRPO/RLVR post-training) → GestaltLabs F16 GGUF → our STRIX_LEAN.
5 GB minimum VRAM for the GGUF alone; 12 GB with KV offload at 128 K. The mesh's 16 GB card runs it with ~3 GB headroom at 128 K ctx.
rep_pen=1.1 is mandatory for the agent loop. Without it, the model loops on chained and multi_step (3/5 PASS). This is the same Fable5 tool-loop pattern — a property of the RLVR/GRPO SFT family, not the quant. The fix is universal: add --repeat-penalty 1.1 to the serve command. (Note: this is unusual for an Apache-2.0 release to require; upstream's HF card does not document it. A friendly bug report to GestaltLabs is the right next step.)
--spec-type mtp is a typo in the upstream HF model card. The correct flag for llama-server is --spec-type draft-mtp. The mtp form is rejected with a list of valid types. This is a separate upstream bug.
Vision requires the mmproj-ornstein-v2-f16.gguf companion file. Not bundled in this repo; download from GestaltLabs/Ornstein-3.5-9B-V2-GGUF. The model card there labels the mmproj as mmproj-ornstein-v2-f16.gguf and notes it covers image + video input.
No MTP / speculative-decode sweep on this file beyond the default --spec-draft-n-max 3 --spec-draft-p-min 0.75. The mesh's MTP sweep was on Fable5 (Node D CUDA, +81% decode on Fable5 Q8 + MTP). Ornstein MTP settings may have a different optimal; we used the upstream-recommended values.
No quality benchmark (perplexity, MMLU, GSM8K). GestaltLabs's own published GPQA / reasoning numbers on the parent V2 model are 1.00 / 1.00 on the GBS-200 suite (per their HF card) — those are upstream's numbers, not ours.

Provenance

Source model: GestaltLabs/Ornstein-3.5-9B-V2 — 9.2 B params, Qwen3.5 9B base + DPO + GRPO/RLVR post-training, vision tower + MTP head baked in
Source model license: apache-2.0
Source GGUF uploader: GestaltLabs (the model authors themselves)
Companion file: mmproj-ornstein-v2-f16.gguf (~921 MB) in GestaltLabs/Ornstein-3.5-9B-V2-GGUF (NOT in this repo — see "Files in this repo" above)
Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
Quantizer license: MIT
Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py + scripts/mesh-bench/ctx_scaling_bench.py from the meshina repo (private)
Original bench report: raw/benchmarks/2026-06-28-ornstein-charlie-bench/SUMMARY.md in the meshina repo (177 lines, full session record + cross-model comparison + 6 caveats)
Research note on 27B Ornstein feasibility: raw/research/2026-06-28-ornstein-27b-charlie-size-math.md (concludes 27B Ornstein is not feasible on 16 GB at 128 K; defer to 24 GB+ hardware)

License

The Ornstein 3.5 9B V2 parent is apache-2.0 (per its HF model card).
The charlie12345/ROCmFPX quantizer is MIT.
The GGUF in this repo is a derivative of the apache-2.0 parent, produced with the MIT-licensed quantizer. Both upstream licenses are preserved.

Downloads last month: 259

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

GestaltLabs/Ornstein-3.5-9B-V1.5

Finetuned

GestaltLabs/Ornstein-3.5-9B-V2

Quantized

(4)

this model

Collection including maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

ROCmFPX Quants

Collection

Series of different model quants using; https://github.com/charlie12345/ROCmFPX https://github.com/charlie12345/rocmfp4-llama • 5 items • Updated 5 days ago

Ornstein-3.5-9B-V2 ROCmFPX STRIX_LEAN — GGUF

Scope of these benchmarks — read this first

What we measured

Throughput (mesh_eval, 4 K ctx, MTP-ON, turbo4 KV, rep_pen=1.1)

Agent / loop validation (raw JSON: raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json)

Context scaling (raw JSON: ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json)

KV cache type (head_dim=128, same as Ornith + Vibetuned)

Quick start

Reproduce the quant

Files in this repo

What's NOT in this repo (caveats)

Provenance

License

Model tree for maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Collection including maczzzzzz/Ornstein-3.5-9B-V2-ROCmFPX-STRIX_LEAN-GGUF

Agent / loop validation (raw JSON: `raw-hermes-loop-ornstein-v2-strix-lean-reppen1.1.json`)

Context scaling (raw JSON: `ctx-scaling-ornstein-v2-strix-lean-20260628-212553.json`)