Q-RAG-50M-Sovereign — the sovereign retrieval head that punches above its weight

A 50M-parameter relevance scorer that beats BGE-reranker-large (560M, 11× larger) on in-distribution refusal and ties or beats 4 of 11 tested rerankers/embeddings on out-of-distribution BEIR — at 50M params, on CPU, fully sovereign.

What this model does, in one sentence

Given a USER query and a CANDIDATE passage, Q-RAG outputs exactly one character — 1 if the passage is relevant to the query, 0 if it is not — making it a drop-in relevance filter for any RAG (retrieval-augmented generation) pipeline.

Headline: where Q-RAG wins, where it loses, why both matter

In-distribution (10-domain Q-RAG holdout, 30 rows): #1 of 11

Q-RAG was trained on cross-domain refusal as a first-class objective — every query paired with both same-domain near-miss adversaries and cross-domain off-topic passages. On the holdout that tests exactly this, Q-RAG beats every model we evaluated, including BGE-reranker-large (560M) and BGE-reranker-v2-m3 (568M) — 11× our parameter count.

Rank	Model	Params	Acc	Carry-12	Cross-18
1	Q-RAG-50M-Sovereign	50M	100.0%	100.0%	100.0%
2	bge-reranker-large	560M	96.7%	100.0%	94.4%
2	bge-reranker-v2-m3	568M	96.7%	100.0%	94.4%
4	ms-marco-MiniLM-L-6-v2	23M	93.3%	100.0%	88.9%
4	ms-marco-MiniLM-L-12-v2	33M	93.3%	100.0%	88.9%
4	mxbai-rerank-xsmall-v1	70M	93.3%	100.0%	88.9%
4	gte-reranker-modernbert-base	149M	93.3%	100.0%	88.9%
8	e5-small-v2	33M	90.0%	100.0%	83.3%
8	bge-reranker-base	278M	90.0%	100.0%	83.3%
10	bge-small-en-v1.5	33M	86.7%	100.0%	77.8%
10	bge-m3	568M	86.7%	91.7%	83.3%

All baselines are at their oracle threshold (the threshold chosen to maximize their accuracy on the full holdout — a generous upper bound). Q-RAG outputs 1 or 0 directly with no threshold to tune.

Out-of-distribution (BEIR NFCorpus + SciFact slice, 250 rows): rank 9 of 12 — but the gap is tiny

We also tested on BEIR, a public IR benchmark. The slice combines NFCorpus (medical literature retrieval) and SciFact (scientific claim verification) — domains Q-RAG was not trained on. 25 queries each, 1 positive + 4 hard negatives per query.

Rank	Model	Params	BEIR Acc	Lat (ms)
1	bge-small-en-v1.5	33M	93.2%	38
2	ms-marco-MiniLM-L-6-v2	23M	92.4%	19
2	gte-reranker-modernbert-base	149M	92.4%	147
4	e5-small-v2	33M	92.0%	37
5	bge-reranker-v2-m3	568M	90.8%	391
5	bge-m3	568M	90.8%	396
7	ms-marco-MiniLM-L-12-v2	33M	90.4%	38
7	bge-reranker-base	278M	90.4%	119
9	Q-RAG-50M-Sovereign	50M	89.6%	168
9	mxbai-rerank-xsmall-v1	70M	89.6%	919
11	bge-reranker-large	560M	88.4%	392

Honest reading. On medical+scientific OOD, Q-RAG lands rank 9 of 12 at 89.6%. But the field is tight: only 3.6 points separate the leader (bge-small-en-v1.5 at 93.2%) from Q-RAG, and Q-RAG outright beats BGE-reranker-large (560M, 11× larger) by 1.2 points and ties mxbai-rerank-xsmall. Models like BGE-reranker-v2-m3 and bge-m3 (568M) finish only 1.2 points ahead of us at over 10× the size.

Models with 11× our parameters are not 11× better at this task — the curve flattens hard. That's what "punching above your weight" looks like: a 50M model trading punches with 560M-parameter rerankers on data it wasn't even trained on, while still being #1 on the data it was trained for.

How Q-RAG punches above its weight

Three technical choices, applied together, produce the result above. None are individually novel; the combination is what works at 50M params.

1. Cross-domain refusal as a first-class training objective, not a side effect

Most retrieval models — embeddings and rerankers alike — are trained on positive ranking signal (MS MARCO click-through, NLI entailment, etc.). They learn what "more relevant" looks like, then hope the threshold separates the relevant from the irrelevant.

Q-RAG was trained explicitly on cross-domain off-topic refusal — every query in the corpus was paired against 5 passages drawn from other domains, labeled 0, and weighted higher than the positives during the loss computation. The model learned that the default answer for "wrong domain" is refuse, not score it low and hope the threshold catches it. The result: 100% on the cross-domain refusal subset, where bge-m3 (568M) drops to 83.3%.

2. Adversarial same-domain near-miss negatives

The hardest failure for an embedding model is a same-shape-but-wrong-specific-answer passage. "Paris is the capital of France" sits near "Berlin is the capital of Germany" in embedding space — same sentence structure, same topic family, same vocabulary register. The cosine similarity says yes; relevance says no.

For every topic in training, Q-RAG sees 4–6 same-domain wrong-specific-answer passages weighted even higher than the positives. The model learned the shape of "wrong-but-shaped-right" and refuses cleanly. This is the failure mode that drives most production RAG hallucinations.

3. Binary token output, not a score

Embedding models output a vector; you compare via cosine and choose a threshold. Rerankers output a logit; you choose a threshold. Both leave the calibration as the operator's problem — and the right threshold depends on the domain, the retriever upstream, and the size of the candidate set.

Q-RAG outputs a single token: 1 or 0. No threshold to tune. No calibration per pipeline. Drop it in after your dense retriever; pass through every passage that scores 1; refuse if none do. The training objective is binary cross-entropy on that exact token; the inference path is a single argmax on the next-token distribution. No magic.

The result is a small, fast head you put after your dense retriever to filter relevant passages before paying token cost on a 7B+ answer model.

Are we new? Yes — and we trained from a sovereign base

Q-RAG is 53.5M parameters and was full-fine-tuned from tjarvis91/qovaryx-50m-scratch-base — a base we pretrained ourselves from random initialization on 491.5M tokens with our own BPE tokenizer (english_v1, vocab 32000).

Not SmolLM2. Not Qwen. Not Llama. Not Mistral. Not Phi. No borrowed foundation model. No closed-source weights. Every parameter traces back to a Qovaryx training run on Qovaryx hardware.

That matters for two reasons:

No license entanglement — Apache 2.0 all the way down, full audit trail in this repo.
No baked-in priors from someone else's training set — when we say Q-RAG was trained on cross-domain refusal, we mean it didn't see the BEIR test set or anything contaminated with it during base pretraining either.

What problem this actually solves

You're already running RAG. Your dense retriever returns top-k passages. Some are relevant. Some are not. You don't want to pay for an LLM call on the not-relevant ones, and you don't want them in the answer model's context wasting attention. Q-RAG is the relevance filter between retrieve and generate.

Step	What you had	What Q-RAG adds
1. Retrieve top-k passages	dense embedding model	(unchanged)
2. Filter for relevance	— usually skipped	Q-RAG: 1 forward pass per passage, output 1 or 0
3. Generate answer	big LLM with all k passages	big LLM with only the relevant ones

Pipeline impact:

Cheaper — generation cost only on relevant passages.
More accurate — fewer red-herring passages in the answer model's context.
More refusable — if Q-RAG drops every passage, the system knows to say "I don't have evidence to answer that" instead of hallucinating.

How to load it (Python)

import torch
from tokenizers import Tokenizer
from bleeding_edge.model.decoder import FinanceDecoder, DecoderConfig

tok = Tokenizer.from_file("tokenizer.json")
ckpt = torch.load("pytorch_model.pt", map_location="cpu", weights_only=False)
cfg = DecoderConfig(**{k: v for k, v in ckpt["model_cfg"].items() if k in DecoderConfig.__dataclass_fields__})
cfg.vocab_size = tok.get_vocab_size()
model = FinanceDecoder(cfg).eval()
state = {k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()}
model.load_state_dict(state, strict=False)

SYSTEM = (
    "You are Q-Retriever. Given a USER query and a CANDIDATE passage, "
    "decide whether the passage is relevant to the query. "
    "Output exactly one character: 1 if relevant, 0 if not relevant. "
    "Refuse to invent relevance: if the passage does not address the query, output 0."
)

def score(query: str, passage: str) -> int:
    prompt = f"{SYSTEM}\n\nUSER: Q: {query}\n\nPASSAGE:\n{passage}\n\nASSISTANT: "
    ids = tok.encode(prompt).ids
    cur = torch.tensor([ids], dtype=torch.long)
    with torch.no_grad():
        nxt = int(torch.argmax(model(cur, return_decision=False).logits[:, -1, :], dim=-1))
    return 1 if tok.decode([nxt]).strip() == "1" else 0

print(score("capital of Germany", "Berlin is the capital of Germany."))  # 1
print(score("capital of Germany", "Paris is the capital of France."))     # 0
print(score("how to git commit", "The Nile is the longest river."))       # 0

Architecture (Qovaryx proprietary FinanceDecoder)

53.5M parameters
12 decoder blocks, d_model = 512, n_head = 8, GQA n_kv_head = 2
SwiGLU FFN, RoPE positional, RMSNorm
Multi-token prediction (MTP) auxiliary heads
Decision head for routed-decision tasks
Tokenizer: Qovaryx english_v1 BPE, vocab 32000 (in-house)
Pretrained from qovaryx-50m-scratch-base step 60000 → 491.5M tokens
Full fine-tune (no LoRA, no QLoRA, no adapter): every parameter was updated on the Qovaryx Q-RAG crystal corpus

What this model is NOT

Not a sentence embedding model. No vector output. Use it after your dense retriever, not instead.
Not a general-purpose chatbot. Free-text generation outside the relevance-scoring task surface will degrade.
Not the top BEIR scorer — bge-small-en-v1.5 is 3.6 points ahead on BEIR. If your retrieval is exclusively medical/scientific OOD, run that baseline.
Not reproducible from this card. Weights, holdouts, and benchmark numbers are public; the crystal corpus generator and training hyperparameters are not.

License & posture

Apache 2.0 for the published weights, model card, holdouts, and benchmark JSONs.

The Qovaryx scratch base build pipeline, the Q-RAG crystal corpus generator, the eval gate constants, the cluster routing policy, and the protected runtime entrypoint are Qovaryx proprietary technology and are not included.

Reproduction & artifacts in this repo

pytorch_model.pt — Q-RAG weights (v10, 205 MB)
tokenizer.json — Qovaryx english_v1 BPE
config.json — model config
holdout_eval.json — full per-row in-house holdout result (30/30 = 100%)
benchmark_vs_embeddings.json — in-house holdout vs 10 baselines (Q-RAG #1)
benchmark_beir.json — BEIR NFCorpus+SciFact slice vs same baselines
Reproduction scripts: scripts/benchmark_q_rag_vs_embeddings.py and scripts/benchmark_q_rag_vs_rerankers_beir.py in the upstream research repo

Sibling specialists in the Qovaryx Compact Specialist Suite

All ten specialists share the qovaryx-50m-scratch-base and the same audit discipline. Use one directly; use all ten through the cluster shell.

Q-Triage — ticket routing
Q-DocCite — document citation
Q-Invoice — invoice extraction
Q-ToolCall — agent tool-calls
Q-Meeting — meeting structuring
Q-FinCite — 10-K/10-Q citation
Q-CmdSafe — command safety triage
Q-SheetExtract — spreadsheet extraction
Q-Coder — Python code skeletons
Q-RAG (this model) — relevance filter for RAG

Reproduction invitation

If you run Q-RAG against a model not in our table — Cohere Rerank, Voyage Rerank, jina-reranker-v2, ColBERT, or anything else — please open a discussion on this repo with the numbers. We'll add it to the card, honestly, whichever direction the result falls. The benchmark script + holdouts are in this repo.

Official site & community

The full Qovaryx runtime that orchestrates this specialist alongside the other nine ships from:

Site: https://qovaryx.jehorizon.com
Download (desktop beta): https://qovaryx.jehorizon.com/download.html
Research devlog: https://qovaryx.jehorizon.com/research
Community Discord: https://discord.gg/PtuHZDv5ju
Ko-fi (we cover GPU bills): https://ko-fi.com/tjarvis91
Open research repo: https://github.com/thron-j/qovaryx-ai-research

If you find a failure mode this card doesn't cover, open a discussion or come to the Discord — that's how the next crystal corpus gets written.

Downloads last month: 172

Model tree for tjarvis91/Q-RAG-50M-Sovereign

Base model

tjarvis91/qovaryx-50m-scratch-base

Finetuned

(10)

this model