Saul 7B Instruct v1 GGUF

GGUF quantizations of Equall/Saul-7B-Instruct-v1, verified end-to-end on the NVIDIA DGX Spark (GB10, 128 GB unified memory).

Spark-tested

Every Orionfold quant ships with a measurement quad on the NVIDIA DGX Spark (GB10, 128 GB unified memory): perplexity, sustained tok/s, thermal envelope, and LegalBench (n=50, contains) accuracy. The numbers below are the actual run, not a wishlist.

Variant Size Perplexity (wikitext-2) tok/s on Spark LegalBench (n=50, contains)
Q4_K_M 4.1 GB 5.986 29.4 62.0%
Q5_K_M 4.8 GB 5.938 20.2 72.0%
Q6_K 5.5 GB 5.925 22.4 68.0%
Q8_0 7.2 GB 5.914 7.3 66.0%
F16 13.5 GB 5.917 10.9 68.0%

Thermal envelope: sustained-load minutes before thermal throttle on a single GB10 = 2 min. Beyond this, expect tok/s degradation; the duty-cycle disclosure is per Orionfold's quant-card standard.

Variants

Variant Recommended use
Q4_K_M Best balance โ€” fits comfortably in Spark unified memory at 70B; default pick.
Q5_K_M Higher quality than Q4_K_M with modest size bump.
Q6_K Near-lossless; recommended if memory headroom allows.
Q8_0 Effectively lossless; reach for this when quality matters more than throughput.
F16 Reference โ€” no quantization. Use only for measurement / baseline.

How to run

Pull a variant:

huggingface-cli download Orionfold/Saul-7B-Instruct-v1-GGUF model-Q5_K_M.gguf \
  --local-dir ./models/saul-7b-instruct-v1

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/saul-7b-instruct-v1/model-Q5_K_M.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/saul-7b-instruct-v1/model-Q5_K_M.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="mistral",
)
out = llm.create_chat_completion(
    messages=[
        {"role": "user",
         "content": "Does the following sentence overrule a previous case? "
                    "Sentence: 'curtman is overruled to the extent it conflicts with evans.'"}
    ],
    temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Methods

Full methodology and Spark-side measurement protocol: Vertical-curator quants on Spark โ€” Saul-7B-Instruct-v1-GGUF + LegalBench mini-eval.

Other Orionfold vertical curators

Same Spark-tested recipe across the curator-on-Spark series:

  • finance-chat-GGUF โ€” AdaptLLM finance-chat (Llama-2-7B lineage) for FinanceBench-shaped queries
  • SecurityLLM-GGUF โ€” Mistral-based cyber-tuned model with CyberMetric mini-eval gating
  • II-Medical-8B-GGUF โ€” Qwen3-8B + DAPO reasoning for MedMCQA-shaped queries

Each card lists its own measurement quad; the headline numbers are recorded as the actual sweep ran, never pre-corrected.


Published by Orionfold LLC ยท orionfold.com ยท Methods documented at ainative.business/field-notes.

Want to know when the next Orionfold vertical curator drops? Join the launch list at orionfold.com.

Downloads last month
284
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Orionfold/Saul-7B-Instruct-v1-GGUF

Quantized
(16)
this model