finance chat GGUF

GGUF quantizations of AdaptLLM/finance-chat, verified end-to-end on the NVIDIA DGX Spark (GB10, 128 GB unified memory).

Spark-tested

Every Orionfold quant ships with a measurement quad on the NVIDIA DGX Spark (GB10, 128 GB unified memory): perplexity, sustained tok/s, thermal envelope, and FinanceBench (n=50, numeric_match) accuracy. The numbers below are the actual run, not a wishlist.

Variant Size Perplexity (wikitext-2) tok/s on Spark FinanceBench (n=50, numeric_match)
Q4_K_M 3.8 GB 6.221 31.1 14.0%
Q5_K_M 4.5 GB 6.164 26.9 16.0%
Q6_K 5.1 GB 6.147 23.9 16.0%
Q8_0 6.7 GB 6.137 8.9 18.0%
F16 12.6 GB 6.137 11.5 18.0%

Thermal envelope: sustained-load minutes before thermal throttle on a single GB10 = 2 min. Beyond this, expect tok/s degradation; the duty-cycle disclosure is per Orionfold's quant-card standard.

Variants

Variant Recommended use
Q4_K_M Best balance โ€” fits comfortably in Spark unified memory at 70B; default pick.
Q5_K_M Higher quality than Q4_K_M with modest size bump.
Q6_K Near-lossless; recommended if memory headroom allows.
Q8_0 Effectively lossless; reach for this when quality matters more than throughput.
F16 Reference โ€” no quantization. Use only for measurement / baseline.

How to run

Pull a variant:

huggingface-cli download Orionfold/finance-chat-GGUF model-Q5_K_M.gguf \
  --local-dir ./models/finance-chat

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/finance-chat/model-Q5_K_M.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/finance-chat/model-Q5_K_M.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="llama-2",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain working capital."}],
    temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Methods

Full methodology and Spark-side measurement protocol: Vertical-curator quants on Spark โ€” finance-chat-GGUF + FinanceBench mini-eval.

Other Orionfold vertical curators

Same Spark-tested recipe across the curator-on-Spark series:

Each card lists its own measurement quad; the headline numbers are recorded as the actual sweep ran, never pre-corrected.


Published by Orionfold LLC ยท orionfold.com ยท Methods documented at ainative.business/field-notes.

Want to know when the next Orionfold vertical curator drops? Join the launch list at orionfold.com.

Downloads last month
188
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Orionfold/finance-chat-GGUF

Quantized
(6)
this model