- Q-RAG-50M-Sovereign β the sovereign retrieval head that punches above its weight
- What this model does, in one sentence
- Headline: where Q-RAG wins, where it loses, why both matter
- How Q-RAG punches above its weight
- Are we new? Yes β and we trained from a sovereign base
- What problem this actually solves
- How to load it (Python)
- Architecture (Qovaryx proprietary FinanceDecoder)
- What this model is NOT
- License & posture
- Reproduction & artifacts in this repo
- Sibling specialists in the Qovaryx Compact Specialist Suite
- Reproduction invitation
- Official site & community
- What this model does, in one sentence
Q-RAG-50M-Sovereign β the sovereign retrieval head that punches above its weight
A 50M-parameter relevance scorer that beats BGE-reranker-large (560M, 11Γ larger) on in-distribution refusal and ties or beats 4 of 11 tested rerankers/embeddings on out-of-distribution BEIR β at 50M params, on CPU, fully sovereign.
What this model does, in one sentence
Given a USER query and a CANDIDATE passage, Q-RAG outputs exactly one character β 1 if the passage is relevant to the query, 0 if it is not β making it a drop-in relevance filter for any RAG (retrieval-augmented generation) pipeline.
Headline: where Q-RAG wins, where it loses, why both matter
In-distribution (10-domain Q-RAG holdout, 30 rows): #1 of 11
Q-RAG was trained on cross-domain refusal as a first-class objective β every query paired with both same-domain near-miss adversaries and cross-domain off-topic passages. On the holdout that tests exactly this, Q-RAG beats every model we evaluated, including BGE-reranker-large (560M) and BGE-reranker-v2-m3 (568M) β 11Γ our parameter count.
| Rank | Model | Params | Acc | Carry-12 | Cross-18 |
|---|---|---|---|---|---|
| 1 | Q-RAG-50M-Sovereign | 50M | 100.0% | 100.0% | 100.0% |
| 2 | bge-reranker-large | 560M | 96.7% | 100.0% | 94.4% |
| 2 | bge-reranker-v2-m3 | 568M | 96.7% | 100.0% | 94.4% |
| 4 | ms-marco-MiniLM-L-6-v2 | 23M | 93.3% | 100.0% | 88.9% |
| 4 | ms-marco-MiniLM-L-12-v2 | 33M | 93.3% | 100.0% | 88.9% |
| 4 | mxbai-rerank-xsmall-v1 | 70M | 93.3% | 100.0% | 88.9% |
| 4 | gte-reranker-modernbert-base | 149M | 93.3% | 100.0% | 88.9% |
| 8 | e5-small-v2 | 33M | 90.0% | 100.0% | 83.3% |
| 8 | bge-reranker-base | 278M | 90.0% | 100.0% | 83.3% |
| 10 | bge-small-en-v1.5 | 33M | 86.7% | 100.0% | 77.8% |
| 10 | bge-m3 | 568M | 86.7% | 91.7% | 83.3% |
All baselines are at their oracle threshold (the threshold chosen to maximize their accuracy on the full holdout β a generous upper bound). Q-RAG outputs 1 or 0 directly with no threshold to tune.
Out-of-distribution (BEIR NFCorpus + SciFact slice, 250 rows): rank 9 of 12 β but the gap is tiny
We also tested on BEIR, a public IR benchmark. The slice combines NFCorpus (medical literature retrieval) and SciFact (scientific claim verification) β domains Q-RAG was not trained on. 25 queries each, 1 positive + 4 hard negatives per query.
| Rank | Model | Params | BEIR Acc | Lat (ms) |
|---|---|---|---|---|
| 1 | bge-small-en-v1.5 | 33M | 93.2% | 38 |
| 2 | ms-marco-MiniLM-L-6-v2 | 23M | 92.4% | 19 |
| 2 | gte-reranker-modernbert-base | 149M | 92.4% | 147 |
| 4 | e5-small-v2 | 33M | 92.0% | 37 |
| 5 | bge-reranker-v2-m3 | 568M | 90.8% | 391 |
| 5 | bge-m3 | 568M | 90.8% | 396 |
| 7 | ms-marco-MiniLM-L-12-v2 | 33M | 90.4% | 38 |
| 7 | bge-reranker-base | 278M | 90.4% | 119 |
| 9 | Q-RAG-50M-Sovereign | 50M | 89.6% | 168 |
| 9 | mxbai-rerank-xsmall-v1 | 70M | 89.6% | 919 |
| 11 | bge-reranker-large | 560M | 88.4% | 392 |
Honest reading. On medical+scientific OOD, Q-RAG lands rank 9 of 12 at 89.6%. But the field is tight: only 3.6 points separate the leader (bge-small-en-v1.5 at 93.2%) from Q-RAG, and Q-RAG outright beats BGE-reranker-large (560M, 11Γ larger) by 1.2 points and ties mxbai-rerank-xsmall. Models like BGE-reranker-v2-m3 and bge-m3 (568M) finish only 1.2 points ahead of us at over 10Γ the size.
Models with 11Γ our parameters are not 11Γ better at this task β the curve flattens hard. That's what "punching above your weight" looks like: a 50M model trading punches with 560M-parameter rerankers on data it wasn't even trained on, while still being #1 on the data it was trained for.
How Q-RAG punches above its weight
Three technical choices, applied together, produce the result above. None are individually novel; the combination is what works at 50M params.
1. Cross-domain refusal as a first-class training objective, not a side effect
Most retrieval models β embeddings and rerankers alike β are trained on positive ranking signal (MS MARCO click-through, NLI entailment, etc.). They learn what "more relevant" looks like, then hope the threshold separates the relevant from the irrelevant.
Q-RAG was trained explicitly on cross-domain off-topic refusal β every query in the corpus was paired against 5 passages drawn from other domains, labeled 0, and weighted higher than the positives during the loss computation. The model learned that the default answer for "wrong domain" is refuse, not score it low and hope the threshold catches it. The result: 100% on the cross-domain refusal subset, where bge-m3 (568M) drops to 83.3%.
2. Adversarial same-domain near-miss negatives
The hardest failure for an embedding model is a same-shape-but-wrong-specific-answer passage. "Paris is the capital of France" sits near "Berlin is the capital of Germany" in embedding space β same sentence structure, same topic family, same vocabulary register. The cosine similarity says yes; relevance says no.
For every topic in training, Q-RAG sees 4β6 same-domain wrong-specific-answer passages weighted even higher than the positives. The model learned the shape of "wrong-but-shaped-right" and refuses cleanly. This is the failure mode that drives most production RAG hallucinations.
3. Binary token output, not a score
Embedding models output a vector; you compare via cosine and choose a threshold. Rerankers output a logit; you choose a threshold. Both leave the calibration as the operator's problem β and the right threshold depends on the domain, the retriever upstream, and the size of the candidate set.
Q-RAG outputs a single token: 1 or 0. No threshold to tune. No calibration per pipeline. Drop it in after your dense retriever; pass through every passage that scores 1; refuse if none do. The training objective is binary cross-entropy on that exact token; the inference path is a single argmax on the next-token distribution. No magic.
The result is a small, fast head you put after your dense retriever to filter relevant passages before paying token cost on a 7B+ answer model.
Are we new? Yes β and we trained from a sovereign base
Q-RAG is 53.5M parameters and was full-fine-tuned from tjarvis91/qovaryx-50m-scratch-base β a base we pretrained ourselves from random initialization on 491.5M tokens with our own BPE tokenizer (english_v1, vocab 32000).
Not SmolLM2. Not Qwen. Not Llama. Not Mistral. Not Phi. No borrowed foundation model. No closed-source weights. Every parameter traces back to a Qovaryx training run on Qovaryx hardware.
That matters for two reasons:
- No license entanglement β Apache 2.0 all the way down, full audit trail in this repo.
- No baked-in priors from someone else's training set β when we say Q-RAG was trained on cross-domain refusal, we mean it didn't see the BEIR test set or anything contaminated with it during base pretraining either.
What problem this actually solves
You're already running RAG. Your dense retriever returns top-k passages. Some are relevant. Some are not. You don't want to pay for an LLM call on the not-relevant ones, and you don't want them in the answer model's context wasting attention. Q-RAG is the relevance filter between retrieve and generate.
| Step | What you had | What Q-RAG adds |
|---|---|---|
| 1. Retrieve top-k passages | dense embedding model | (unchanged) |
| 2. Filter for relevance | β usually skipped | Q-RAG: 1 forward pass per passage, output 1 or 0 |
| 3. Generate answer | big LLM with all k passages | big LLM with only the relevant ones |
Pipeline impact:
- Cheaper β generation cost only on relevant passages.
- More accurate β fewer red-herring passages in the answer model's context.
- More refusable β if Q-RAG drops every passage, the system knows to say "I don't have evidence to answer that" instead of hallucinating.
How to load it (Python)
import torch
from tokenizers import Tokenizer
from bleeding_edge.model.decoder import FinanceDecoder, DecoderConfig
tok = Tokenizer.from_file("tokenizer.json")
ckpt = torch.load("pytorch_model.pt", map_location="cpu", weights_only=False)
cfg = DecoderConfig(**{k: v for k, v in ckpt["model_cfg"].items() if k in DecoderConfig.__dataclass_fields__})
cfg.vocab_size = tok.get_vocab_size()
model = FinanceDecoder(cfg).eval()
state = {k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()}
model.load_state_dict(state, strict=False)
SYSTEM = (
"You are Q-Retriever. Given a USER query and a CANDIDATE passage, "
"decide whether the passage is relevant to the query. "
"Output exactly one character: 1 if relevant, 0 if not relevant. "
"Refuse to invent relevance: if the passage does not address the query, output 0."
)
def score(query: str, passage: str) -> int:
prompt = f"{SYSTEM}\n\nUSER: Q: {query}\n\nPASSAGE:\n{passage}\n\nASSISTANT: "
ids = tok.encode(prompt).ids
cur = torch.tensor([ids], dtype=torch.long)
with torch.no_grad():
nxt = int(torch.argmax(model(cur, return_decision=False).logits[:, -1, :], dim=-1))
return 1 if tok.decode([nxt]).strip() == "1" else 0
print(score("capital of Germany", "Berlin is the capital of Germany.")) # 1
print(score("capital of Germany", "Paris is the capital of France.")) # 0
print(score("how to git commit", "The Nile is the longest river.")) # 0
Architecture (Qovaryx proprietary FinanceDecoder)
- 53.5M parameters
- 12 decoder blocks, d_model = 512, n_head = 8, GQA n_kv_head = 2
- SwiGLU FFN, RoPE positional, RMSNorm
- Multi-token prediction (MTP) auxiliary heads
- Decision head for routed-decision tasks
- Tokenizer: Qovaryx
english_v1BPE, vocab 32000 (in-house) - Pretrained from
qovaryx-50m-scratch-basestep 60000 β 491.5M tokens - Full fine-tune (no LoRA, no QLoRA, no adapter): every parameter was updated on the Qovaryx Q-RAG crystal corpus
What this model is NOT
- Not a sentence embedding model. No vector output. Use it after your dense retriever, not instead.
- Not a general-purpose chatbot. Free-text generation outside the relevance-scoring task surface will degrade.
- Not the top BEIR scorer β bge-small-en-v1.5 is 3.6 points ahead on BEIR. If your retrieval is exclusively medical/scientific OOD, run that baseline.
- Not reproducible from this card. Weights, holdouts, and benchmark numbers are public; the crystal corpus generator and training hyperparameters are not.
License & posture
Apache 2.0 for the published weights, model card, holdouts, and benchmark JSONs.
The Qovaryx scratch base build pipeline, the Q-RAG crystal corpus generator, the eval gate constants, the cluster routing policy, and the protected runtime entrypoint are Qovaryx proprietary technology and are not included.
Reproduction & artifacts in this repo
pytorch_model.ptβ Q-RAG weights (v10, 205 MB)tokenizer.jsonβ Qovaryx english_v1 BPEconfig.jsonβ model configholdout_eval.jsonβ full per-row in-house holdout result (30/30 = 100%)benchmark_vs_embeddings.jsonβ in-house holdout vs 10 baselines (Q-RAG #1)benchmark_beir.jsonβ BEIR NFCorpus+SciFact slice vs same baselines- Reproduction scripts:
scripts/benchmark_q_rag_vs_embeddings.pyandscripts/benchmark_q_rag_vs_rerankers_beir.pyin the upstream research repo
Sibling specialists in the Qovaryx Compact Specialist Suite
All ten specialists share the qovaryx-50m-scratch-base and the same audit discipline. Use one directly; use all ten through the cluster shell.
- Q-Triage β ticket routing
- Q-DocCite β document citation
- Q-Invoice β invoice extraction
- Q-ToolCall β agent tool-calls
- Q-Meeting β meeting structuring
- Q-FinCite β 10-K/10-Q citation
- Q-CmdSafe β command safety triage
- Q-SheetExtract β spreadsheet extraction
- Q-Coder β Python code skeletons
- Q-RAG (this model) β relevance filter for RAG
Reproduction invitation
If you run Q-RAG against a model not in our table β Cohere Rerank, Voyage Rerank, jina-reranker-v2, ColBERT, or anything else β please open a discussion on this repo with the numbers. We'll add it to the card, honestly, whichever direction the result falls. The benchmark script + holdouts are in this repo.
Official site & community
The full Qovaryx runtime that orchestrates this specialist alongside the other nine ships from:
- Site: https://qovaryx.jehorizon.com
- Download (desktop beta): https://qovaryx.jehorizon.com/download.html
- Research devlog: https://qovaryx.jehorizon.com/research
- Community Discord: https://discord.gg/PtuHZDv5ju
- Ko-fi (we cover GPU bills): https://ko-fi.com/tjarvis91
- Open research repo: https://github.com/thron-j/qovaryx-ai-research
If you find a failure mode this card doesn't cover, open a discussion or come to the Discord β that's how the next crystal corpus gets written.
- Downloads last month
- 172
Model tree for tjarvis91/Q-RAG-50M-Sovereign
Base model
tjarvis91/qovaryx-50m-scratch-base