Instructions to use occ-ai/OCC-RAG-0.6B-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use occ-ai/OCC-RAG-0.6B-ONNX with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-generation', 'occ-ai/OCC-RAG-0.6B-ONNX');
OCC-RAG-0.6B-ONNX
GitHub | Technical Report | Cloud | Base model
ONNX export of occ-ai/OCC-RAG-0.6B for
cross-platform inference with ONNX Runtime and in-browser
inference with π€ Transformers.js /
ONNX Runtime Web (WebGPU). It runs the full model locally β no server, no data leaves
the device.
OCC-RAG-0.6B is a 0.6B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. See the base model card for training details and benchmarks.
ONNX variants
All variants share the same tokenizer and graph topology (Qwen3 architecture with KV-cache); they differ only in weight precision. Linear layers are block-quantized with block size 32.
| dtype | File | Size | Description |
|---|---|---|---|
fp32 |
model.onnx (+ model.onnx_data) |
~2.4 GB | Full-precision baseline |
fp16 |
model_fp16.onnx |
~1.2 GB | All weights FP16 |
q8 |
model_quantized.onnx |
~599 MB | Dynamic INT8 (Transformers.js q8 default) |
| β | model_q8.onnx |
~1.1 GB | INT8 MatMul (asymmetric, MatMulNBits) + FP32 embedding & lm_head |
q4 |
model_q4.onnx |
~471 MB | INT4 MatMul + INT4 embedding (GatherBlockQuantized) + INT4 lm_head β smallest |
q4f16 |
model_q4f16.onnx |
~560 MB | INT4 MatMul on a pre-fused FP16 graph β recommended for WebGPU |
q4f32 |
model_q4f32.onnx |
~899 MB | INT4 MatMul + FP32 embedding & lm_head |
Notes:
q4f16is the variant used by the in-browser WebGPU demo. Its RMSNorm is pre-fused into(Skip)SimplifiedLayerNormalizationso ONNX Runtime Web loads it at the default optimization level. Its INT4 weights are quantized from the FP32 master (identical INT4 blobs toq4f32; only the scales differ β FP16 vs FP32).q4quantizes the token embedding and (tied) lm_head as well, giving the smallest footprint at a small quality cost.model_q8.onnx(weight-only INT8 via MatMulNBits) andmodel_q4f32.onnxare addressable bydtypeonly in newer Transformers.js builds; the dynamic-INT8model_quantized.onnxis what the bundleddtype: "q8"maps to.
The variant set follows the onnx-community / LiquidAI/LFM2.5-350M-ONNX layout.
Model files
OCC-RAG-0.6B-ONNX/
βββ config.json
βββ generation_config.json
βββ tokenizer.json
βββ tokenizer_config.json # chat_template inlined (Transformers.js reads it here)
βββ special_tokens_map.json
βββ added_tokens.json
βββ vocab.json
βββ merges.txt
βββ quantize_config.json
βββ onnx/
βββ model.onnx # fp32 (+ model.onnx_data)
βββ model_fp16.onnx
βββ model_q4.onnx
βββ model_q4f16.onnx # β WebGPU
βββ model_q4f32.onnx
βββ model_q8.onnx
βββ model_quantized.onnx # dynamic int8 (dtype "q8")
Input / output format
OCC-RAG uses a structured RAG prompt with special tokens. The chat template accepts a
documents= kwarg and emits the structural tokens automatically β pass the user message
as plain text and the sources as a list of {"text": ...} dicts. The question is wrapped
in <|query_start|> β¦ <|query_end|> and each source in
<|source_start|><|source_id|>N β¦ <|source_end|>.
The response has five sections, each delimited by special tokens: query analysis β
source analysis β reasoning β status (ANSWERABLE / UNANSWERABLE) β answer. Parse the
final answer from <|answer_start|> β¦ <|answer_end|>. Keep skip_special_tokens=False if
you need to read the structural tokens out of the raw output.
We recommend greedy decoding (
do_sample=False), the training/evaluation default baked intogeneration_config.json.
Usage β Transformers.js (browser / Node, WebGPU)
import { pipeline, TextStreamer } from "@huggingface/transformers";
const generator = await pipeline("text-generation", "occ-ai/OCC-RAG-0.6B-ONNX", {
dtype: "q4f16", // WebGPU-friendly; or "q8" / "q4" / "fp16"
device: "webgpu",
});
const question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?";
const documents = [
{ text: "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone." },
{ text: "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there." },
{ text: "Nova Scotia is a province on the east coast of Canada." },
];
// The chat template injects the <|query_*|> / <|source_*|> structural tokens.
const text = generator.tokenizer.apply_chat_template(
[{ role: "user", content: question }],
{ documents, add_generation_prompt: true, tokenize: false },
);
const output = await generator(text, {
max_new_tokens: 512,
do_sample: false,
streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: false }),
});
console.log(output[0].generated_text);
A ready-to-run WebGPU chat demo (Vite + Transformers.js) lives in
huggingface/transformers.js-examples β occ-rag-webgpu.
Usage β ONNX Runtime (Python)
pip install onnxruntime transformers numpy huggingface_hub
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
model_id = "occ-ai/OCC-RAG-0.6B-ONNX"
onnx_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
tok = AutoTokenizer.from_pretrained(model_id)
session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
{"text": "Nova Scotia is a province on the east coast of Canada."},
]
prompt = tok.apply_chat_template(
[{"role": "user", "content": question}],
documents=documents, tokenize=False, add_generation_prompt=True,
)
input_ids = np.array([tok.encode(prompt, add_special_tokens=False)], dtype=np.int64)
cfg = session.get_modelmeta() # see config.json for num_hidden_layers / num_key_value_heads / head_dim
# Greedy decode with KV-cache: feed input_ids + attention_mask + position_ids and the
# past_key_values.{i}.{key,value} inputs (empty on the first step), then loop feeding the
# present.* outputs back in. Stop on eos ids 151643 / 151645 / 151683.
The INT4/INT8 ONNX graphs are weight-only quantized (MatMulNBits / GatherBlockQuantized) and carry a KV-cache interface.
model_q4f16.onnxexpects FP16 KV-cache I/O; the others use FP32. Seeconfig.json(num_hidden_layers,num_key_value_heads,head_dim) for the cache tensor shapes[batch, kv_heads, seq, head_dim].
Limitations
- Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge β not a general-purpose chat or knowledge model.
- Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
- Quantization. The INT4 variants (
q4,q4f16) trade a small amount of quality for size/speed; preferfp16/q8when accuracy matters most.
License
Released under the MIT License, inherited from the base model.
Citation
@misc{savkin2026occragoptimalcognitivecore,
title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
year = {2026},
eprint = {2606.00683},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.00683}
}
- Downloads last month
- 114