OCC-RAG-0.6B-ONNX

OCC-RAG

GitHub  |  Technical Report  |  Cloud  |  Base model

ONNX export of occ-ai/OCC-RAG-0.6B for cross-platform inference with ONNX Runtime and in-browser inference with πŸ€— Transformers.js / ONNX Runtime Web (WebGPU). It runs the full model locally β€” no server, no data leaves the device.

OCC-RAG-0.6B is a 0.6B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. See the base model card for training details and benchmarks.

ONNX variants

All variants share the same tokenizer and graph topology (Qwen3 architecture with KV-cache); they differ only in weight precision. Linear layers are block-quantized with block size 32.

dtype File Size Description
fp32 model.onnx (+ model.onnx_data) ~2.4 GB Full-precision baseline
fp16 model_fp16.onnx ~1.2 GB All weights FP16
q8 model_quantized.onnx ~599 MB Dynamic INT8 (Transformers.js q8 default)
β€” model_q8.onnx ~1.1 GB INT8 MatMul (asymmetric, MatMulNBits) + FP32 embedding & lm_head
q4 model_q4.onnx ~471 MB INT4 MatMul + INT4 embedding (GatherBlockQuantized) + INT4 lm_head β€” smallest
q4f16 model_q4f16.onnx ~560 MB INT4 MatMul on a pre-fused FP16 graph β€” recommended for WebGPU
q4f32 model_q4f32.onnx ~899 MB INT4 MatMul + FP32 embedding & lm_head

Notes:

  • q4f16 is the variant used by the in-browser WebGPU demo. Its RMSNorm is pre-fused into (Skip)SimplifiedLayerNormalization so ONNX Runtime Web loads it at the default optimization level. Its INT4 weights are quantized from the FP32 master (identical INT4 blobs to q4f32; only the scales differ β€” FP16 vs FP32).
  • q4 quantizes the token embedding and (tied) lm_head as well, giving the smallest footprint at a small quality cost.
  • model_q8.onnx (weight-only INT8 via MatMulNBits) and model_q4f32.onnx are addressable by dtype only in newer Transformers.js builds; the dynamic-INT8 model_quantized.onnx is what the bundled dtype: "q8" maps to.

The variant set follows the onnx-community / LiquidAI/LFM2.5-350M-ONNX layout.

Model files

OCC-RAG-0.6B-ONNX/
β”œβ”€β”€ config.json
β”œβ”€β”€ generation_config.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json     # chat_template inlined (Transformers.js reads it here)
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ added_tokens.json
β”œβ”€β”€ vocab.json
β”œβ”€β”€ merges.txt
β”œβ”€β”€ quantize_config.json
└── onnx/
    β”œβ”€β”€ model.onnx            # fp32 (+ model.onnx_data)
    β”œβ”€β”€ model_fp16.onnx
    β”œβ”€β”€ model_q4.onnx
    β”œβ”€β”€ model_q4f16.onnx      # ← WebGPU
    β”œβ”€β”€ model_q4f32.onnx
    β”œβ”€β”€ model_q8.onnx
    └── model_quantized.onnx  # dynamic int8 (dtype "q8")

Input / output format

OCC-RAG uses a structured RAG prompt with special tokens. The chat template accepts a documents= kwarg and emits the structural tokens automatically β€” pass the user message as plain text and the sources as a list of {"text": ...} dicts. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.

The response has five sections, each delimited by special tokens: query analysis β†’ source analysis β†’ reasoning β†’ status (ANSWERABLE / UNANSWERABLE) β†’ answer. Parse the final answer from <|answer_start|> … <|answer_end|>. Keep skip_special_tokens=False if you need to read the structural tokens out of the raw output.

We recommend greedy decoding (do_sample=False), the training/evaluation default baked into generation_config.json.

Usage β€” Transformers.js (browser / Node, WebGPU)

import { pipeline, TextStreamer } from "@huggingface/transformers";

const generator = await pipeline("text-generation", "occ-ai/OCC-RAG-0.6B-ONNX", {
  dtype: "q4f16",   // WebGPU-friendly; or "q8" / "q4" / "fp16"
  device: "webgpu",
});

const question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?";
const documents = [
  { text: "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone." },
  { text: "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there." },
  { text: "Nova Scotia is a province on the east coast of Canada." },
];

// The chat template injects the <|query_*|> / <|source_*|> structural tokens.
const text = generator.tokenizer.apply_chat_template(
  [{ role: "user", content: question }],
  { documents, add_generation_prompt: true, tokenize: false },
);

const output = await generator(text, {
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: false }),
});
console.log(output[0].generated_text);

A ready-to-run WebGPU chat demo (Vite + Transformers.js) lives in huggingface/transformers.js-examples β†’ occ-rag-webgpu.

Usage β€” ONNX Runtime (Python)

pip install onnxruntime transformers numpy huggingface_hub
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

model_id = "occ-ai/OCC-RAG-0.6B-ONNX"
onnx_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
tok = AutoTokenizer.from_pretrained(model_id)

session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]
prompt = tok.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents, tokenize=False, add_generation_prompt=True,
)
input_ids = np.array([tok.encode(prompt, add_special_tokens=False)], dtype=np.int64)

cfg = session.get_modelmeta()  # see config.json for num_hidden_layers / num_key_value_heads / head_dim
# Greedy decode with KV-cache: feed input_ids + attention_mask + position_ids and the
# past_key_values.{i}.{key,value} inputs (empty on the first step), then loop feeding the
# present.* outputs back in. Stop on eos ids 151643 / 151645 / 151683.

The INT4/INT8 ONNX graphs are weight-only quantized (MatMulNBits / GatherBlockQuantized) and carry a KV-cache interface. model_q4f16.onnx expects FP16 KV-cache I/O; the others use FP32. See config.json (num_hidden_layers, num_key_value_heads, head_dim) for the cache tensor shapes [batch, kv_heads, seq, head_dim].

Limitations

  • Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge β€” not a general-purpose chat or knowledge model.
  • Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
  • Quantization. The INT4 variants (q4, q4f16) trade a small amount of quality for size/speed; prefer fp16 / q8 when accuracy matters most.

License

Released under the MIT License, inherited from the base model.

Citation

@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}
Downloads last month
114
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for occ-ai/OCC-RAG-0.6B-ONNX

Quantized
(3)
this model

Collection including occ-ai/OCC-RAG-0.6B-ONNX

Paper for occ-ai/OCC-RAG-0.6B-ONNX