OCC-RAG-1.7B-GGUF

OCC-RAG

 GitHub  |  Technical Report  |  Cloud  |  Base model

GGUF quantizations of occ-ai/OCC-RAG-1.7B for native inference with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

OCC-RAG-1.7B is a 1.7B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. It attains the best faithfulness across all evaluated scales (up to 32B). See the base model card for training details and benchmarks.

Files

The underlying architecture is Qwen3 (1.7B). The chat template is embedded in the GGUF, so llama.cpp/Ollama apply it automatically.

File Quant Size Notes
OCC-RAG-1.7B-Q4_0.gguf Q4_0 1.05 GB 4-bit, legacy — smallest
OCC-RAG-1.7B-Q4_K_M.gguf Q4_K_M 1.11 GB 4-bit K-quant — recommended balance
OCC-RAG-1.7B-Q5_K_M.gguf Q5_K_M 1.26 GB 5-bit K-quant
OCC-RAG-1.7B-Q6_K.gguf Q6_K 1.42 GB 6-bit K-quant
OCC-RAG-1.7B-Q8_0.gguf Q8_0 1.83 GB 8-bit — near-lossless
OCC-RAG-1.7B-F16.gguf F16 3.45 GB 16-bit float (full precision)
OCC-RAG-1.7B-BF16.gguf BF16 3.45 GB 16-bit bfloat (lossless base)

For most uses pick Q4_K_M (smallest good quality) or Q8_0 (highest quality at under 2 GB). All quants are derived from the BF16 base. For an even smaller footprint or in-browser use, see occ-ai/OCC-RAG-0.6B-GGUF.

Usage — llama.cpp

# Run directly from the Hub (downloads the chosen quant)
llama-cli -hf occ-ai/OCC-RAG-1.7B-GGUF:Q4_K_M -p "Hello" -no-cnv

# Or download a file and run it
llama-cli -m OCC-RAG-1.7B-Q4_K_M.gguf -p "Hello" -no-cnv
# (newer llama.cpp: use `llama-completion` for non-interactive runs)

Usage — Ollama

ollama run hf.co/occ-ai/OCC-RAG-1.7B-GGUF:Q4_K_M

Input / output format

OCC-RAG uses a structured RAG prompt with special tokens: the question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>. The response has five sections — query analysis → source analysis → reasoning → status (ANSWERABLE / UNANSWERABLE) → answer — and the final answer is in <|answer_start|> … <|answer_end|>.

The embedded chat template (apply with llama.cpp's --jinja) builds the query/source tokens for you when sources are supplied as documents; alternatively assemble the tokens manually. See the base model card for the full format and a runnable example.

We recommend greedy decoding (--temp 0), the training/evaluation default.

Limitations

  • Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge — not a general-purpose chat or knowledge model.
  • Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
  • Quantization. Lower-bit quants (Q4) trade some quality for size; prefer Q6_K/Q8_0 when accuracy matters most.

License

Released under the MIT License, inherited from the base model.

Citation

@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}
Downloads last month
141
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for occ-ai/OCC-RAG-1.7B-GGUF

Quantized
(3)
this model

Collection including occ-ai/OCC-RAG-1.7B-GGUF

Paper for occ-ai/OCC-RAG-1.7B-GGUF