OCC-RAG-0.6B-ONNX

OCC-RAG

GitHub | Technical Report | Cloud | Base model

ONNX export of occ-ai/OCC-RAG-0.6B for cross-platform inference with ONNX Runtime and in-browser inference with 🤗 Transformers.js / ONNX Runtime Web (WebGPU). It runs the full model locally — no server, no data leaves the device.

OCC-RAG-0.6B is a 0.6B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. See the base model card for training details and benchmarks.

ONNX variants

All variants share the same tokenizer and graph topology (Qwen3 architecture with KV-cache); they differ only in weight precision. Linear layers are block-quantized with block size 32.

dtype	File	Size	Description
`fp32`	`model.onnx` (+ `model.onnx_data`)	~2.4 GB	Full-precision baseline
`fp16`	`model_fp16.onnx`	~1.2 GB	All weights FP16
`q8`	`model_quantized.onnx`	~599 MB	Dynamic INT8 (Transformers.js `q8` default)
—	`model_q8.onnx`	~1.1 GB	INT8 MatMul (asymmetric, MatMulNBits) + FP32 embedding & lm_head
`q4`	`model_q4.onnx`	~471 MB	INT4 MatMul + INT4 embedding (GatherBlockQuantized) + INT4 lm_head — smallest
`q4f16`	`model_q4f16.onnx`	~560 MB	INT4 MatMul on a pre-fused FP16 graph — recommended for WebGPU
`q4f32`	`model_q4f32.onnx`	~899 MB	INT4 MatMul + FP32 embedding & lm_head

Notes:

q4f16 is the variant used by the in-browser WebGPU demo. Its RMSNorm is pre-fused into (Skip)SimplifiedLayerNormalization so ONNX Runtime Web loads it at the default optimization level. Its INT4 weights are quantized from the FP32 master (identical INT4 blobs to q4f32; only the scales differ — FP16 vs FP32).
q4 quantizes the token embedding and (tied) lm_head as well, giving the smallest footprint at a small quality cost.
model_q8.onnx (weight-only INT8 via MatMulNBits) and model_q4f32.onnx are addressable by dtype only in newer Transformers.js builds; the dynamic-INT8 model_quantized.onnx is what the bundled dtype: "q8" maps to.

The variant set follows the onnx-community / LiquidAI/LFM2.5-350M-ONNX layout.

Model files

OCC-RAG-0.6B-ONNX/
├── config.json
├── generation_config.json
├── tokenizer.json
├── tokenizer_config.json     # chat_template inlined (Transformers.js reads it here)
├── special_tokens_map.json
├── added_tokens.json
├── vocab.json
├── merges.txt
├── quantize_config.json
└── onnx/
    ├── model.onnx            # fp32 (+ model.onnx_data)
    ├── model_fp16.onnx
    ├── model_q4.onnx
    ├── model_q4f16.onnx      # ← WebGPU
    ├── model_q4f32.onnx
    ├── model_q8.onnx
    └── model_quantized.onnx  # dynamic int8 (dtype "q8")

Input / output format

OCC-RAG uses a structured RAG prompt with special tokens. The chat template accepts a documents= kwarg and emits the structural tokens automatically — pass the user message as plain text and the sources as a list of {"text": ...} dicts. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.

The response has five sections, each delimited by special tokens: query analysis → source analysis → reasoning → status (ANSWERABLE / UNANSWERABLE) → answer. Parse the final answer from <|answer_start|> … <|answer_end|>. Keep skip_special_tokens=False if you need to read the structural tokens out of the raw output.

We recommend greedy decoding (do_sample=False), the training/evaluation default baked into generation_config.json.

Usage — Transformers.js (browser / Node, WebGPU)

import { pipeline, TextStreamer } from "@huggingface/transformers";

const generator = await pipeline("text-generation", "occ-ai/OCC-RAG-0.6B-ONNX", {
  dtype: "q4f16",   // WebGPU-friendly; or "q8" / "q4" / "fp16"
  device: "webgpu",
});

const question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?";
const documents = [
  { text: "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone." },
  { text: "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there." },
  { text: "Nova Scotia is a province on the east coast of Canada." },
];

// The chat template injects the <|query_*|> / <|source_*|> structural tokens.
const text = generator.tokenizer.apply_chat_template(
  [{ role: "user", content: question }],
  { documents, add_generation_prompt: true, tokenize: false },
);

const output = await generator(text, {
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: false }),
});
console.log(output[0].generated_text);

A ready-to-run WebGPU chat demo (Vite + Transformers.js) lives in huggingface/transformers.js-examples → occ-rag-webgpu.

Usage — ONNX Runtime (Python)

pip install onnxruntime transformers numpy huggingface_hub

import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

model_id = "occ-ai/OCC-RAG-0.6B-ONNX"
onnx_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
tok = AutoTokenizer.from_pretrained(model_id)

session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]
prompt = tok.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents, tokenize=False, add_generation_prompt=True,
)
input_ids = np.array([tok.encode(prompt, add_special_tokens=False)], dtype=np.int64)

cfg = session.get_modelmeta()  # see config.json for num_hidden_layers / num_key_value_heads / head_dim
# Greedy decode with KV-cache: feed input_ids + attention_mask + position_ids and the
# past_key_values.{i}.{key,value} inputs (empty on the first step), then loop feeding the
# present.* outputs back in. Stop on eos ids 151643 / 151645 / 151683.

The INT4/INT8 ONNX graphs are weight-only quantized (MatMulNBits / GatherBlockQuantized) and carry a KV-cache interface. model_q4f16.onnx expects FP16 KV-cache I/O; the others use FP32. See config.json (num_hidden_layers, num_key_value_heads, head_dim) for the cache tensor shapes [batch, kv_heads, seq, head_dim].

Limitations

Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge — not a general-purpose chat or knowledge model.
Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
Quantization. The INT4 variants (q4, q4f16) trade a small amount of quality for size/speed; prefer fp16 / q8 when accuracy matters most.

License

Released under the MIT License, inherited from the base model.

Citation

@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}

Downloads last month: 114

Model tree for occ-ai/OCC-RAG-0.6B-ONNX

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

occ-ai/OCC-RAG-0.6B

Quantized

(3)

this model

Collection including occ-ai/OCC-RAG-0.6B-ONNX

OCC-RAG

Collection

OCC-RAG: Optimal Cognitive Core models for faithful, context-grounded question answering. • 6 items • Updated about 12 hours ago • 21

Paper for occ-ai/OCC-RAG-0.6B-ONNX

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Paper • 2606.00683 • Published 11 days ago • 89