II Medical 8B GGUF

GGUF quantizations of Intelligent-Internet/II-Medical-8B, verified end-to-end on the NVIDIA DGX Spark (GB10, 128 GB unified memory).

Spark-tested

Every Orionfold quant ships with a measurement quad on the NVIDIA DGX Spark (GB10, 128 GB unified memory): perplexity, sustained tok/s, thermal envelope, and MedMCQA (n=50, mcq_letter) accuracy. The numbers below are the actual run, not a wishlist.

Variant Size Perplexity (wikitext-2) tok/s on Spark MedMCQA (n=50, mcq_letter)
Q4_K_M 4.7 GB 16.550 43.6 42.0%
Q5_K_M 5.4 GB 16.242 36.4 52.0%
Q6_K 6.3 GB 16.014 32.8 46.0%
Q8_0 8.1 GB 16.296 28.4 48.0%
F16 15.3 GB 16.268 15.9 48.0%

Thermal envelope: sustained-load minutes before thermal throttle on a single GB10 = 18 min. Beyond this, expect tok/s degradation; the duty-cycle disclosure is per Orionfold's quant-card standard.

Variants

Variant Recommended use
Q4_K_M Best balance โ€” fits comfortably in Spark unified memory at 70B; default pick.
Q5_K_M Higher quality than Q4_K_M with modest size bump.
Q6_K Near-lossless; recommended if memory headroom allows.
Q8_0 Effectively lossless; reach for this when quality matters more than throughput.
F16 Reference โ€” no quantization. Use only for measurement / baseline.

How to run

Pull a variant:

huggingface-cli download Orionfold/II-Medical-8B-GGUF model-Q5_K_M.gguf \
  --local-dir ./models/ii-medical-8b

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/ii-medical-8b/model-Q5_K_M.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/ii-medical-8b/model-Q5_K_M.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="chatml",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Summarize the key idea in one paragraph."}],
    temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Methods

Full methodology and Spark-side measurement protocol: Vertical-curator quants on Spark โ€” II-Medical-8B-GGUF + MedMCQA mini-eval.

Other Orionfold vertical curators

Same Spark-tested recipe across the curator-on-Spark series:

  • finance-chat-GGUF โ€” AdaptLLM finance-chat (Llama-2-7B lineage) for FinanceBench-shaped queries
  • Saul-7B-Instruct-v1-GGUF โ€” Equall Saul-7B legal-instruct for LegalBench-shaped queries
  • SecurityLLM-GGUF โ€” Mistral-based cyber-tuned model with CyberMetric mini-eval gating

Each card lists its own measurement quad; the headline numbers are recorded as the actual sweep ran, never pre-corrected.


Published by Orionfold LLC ยท orionfold.com ยท Methods documented at ainative.business/field-notes.

Want to know when the next Orionfold vertical curator drops? Join the launch list at orionfold.com.

Downloads last month
222
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Orionfold/II-Medical-8B-GGUF

Quantized
(11)
this model