Instructions to use nadirclaw/cascade-verifier-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nadirclaw/cascade-verifier-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="nadirclaw/cascade-verifier-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("nadirclaw/cascade-verifier-v1") model = AutoModelForSequenceClassification.from_pretrained("nadirclaw/cascade-verifier-v1") - Notebooks
- Google Colab
- Kaggle
Cascade Verifier v1 (Nadir)
A fine-tuned DeBERTa-v3-small that scores whether a cheap LLM's answer to a prompt is good enough to ship. Used as the gating signal in Nadir's verifier-gated 2-tier cascade router.
This is the frozen snapshot used to produce Nadir's RouterArena submission:
| Metric | Value |
|---|---|
| arena_F | 0.7358 |
| Accuracy | 0.7518 |
| Cost / 1k queries | $0.2986 |
| Rank | #1 (main split) |
Variants
| File | Size | Precision | Used for |
|---|---|---|---|
model.safetensors |
~541 MB | FP32 | Reference, fine-tuning |
verifier_int8.pt |
~418 MB | INT8 quantized | Production inference (CPU) |
The INT8 variant runs on CPU at ~10ms per (prompt, answer) pair on a modern laptop. That's what our production cascade at api.getnadir.com uses.
Usage
Easiest path: use the open-source router
pip install nadirclaw[trained]
export NADIRCLAW_TIERS_PROFILE=n2_trained
NadirClaw wires the verifier into the full 2-tier cascade automatically. The default threshold is tau = 0.80 (escalate if the verifier scores the cheap answer below 0.80).
Or load directly
This is a sequence-pair classifier, not a single-text scorer. Pass the prompt as text and a structured cheap/expensive block as text_pair. The output is a softmax over 2 classes; the score is P(class=1) = "cheap answer is acceptable."
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("nadirclaw/cascade-verifier-v1")
tokenizer = AutoTokenizer.from_pretrained("nadirclaw/cascade-verifier-v1")
model.eval()
prompt = "What is the capital of France?"
cheap_answer = "Paris is the capital of France."
reference_answer = "" # optional; empty string when running cheap-only
inputs = tokenizer(
text=prompt,
text_pair=f"CHEAP:\n{cheap_answer}\n\nEXPENSIVE:\n{reference_answer}",
truncation=True,
max_length=512,
return_tensors="pt",
)
with torch.no_grad():
logits = model(**inputs).logits[0]
probs = torch.softmax(logits, dim=-1)
score = float(probs[1]) # in [0, 1]; class 1 = "cheap is acceptable"
print(f"verifier_score = {score:.3f}")
Higher score means "this cheap answer is good enough." In our production cascade, scores below 0.80 trigger escalation to a stronger model.
For an INT8-quantized CPU build that matches production latency (~60 ms per call), apply dynamic quantization after loading:
torch.backends.quantized.engine = "qnnpack" # or "fbgemm" on x86
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
INT8 variant (production)
import torch
state = torch.load("verifier_int8.pt", map_location="cpu")
# state is a quantized DebertaV2ForSequenceClassification; load into the
# matching architecture per torch.ao.quantization conventions.
For the full INT8 loading flow including the dynamic quantization wrapper, see nadirclaw/trained_verifier.py in the NadirClaw repository.
Input format
This is a sequence-pair classifier. Two segments:
- Segment A (text): the original
prompt. - Segment B (text_pair): the cheap/expensive answer block.
Segment B template:
CHEAP:
{cheap_answer}
EXPENSIVE:
{reference_answer}
When no reference answer is available (the default in our production cascade because the strong-tier model hasn't been called yet), pass an empty string for {reference_answer}. The verifier still produces a calibrated score; it's just more conservative without a reference.
Truncate the joined input to 512 tokens.
Output: softmax over 2 classes. Use P(class=1) as the acceptance score in [0, 1]. Threshold at tau = 0.80 to match the production deployment.
Production wrapper signature (in NadirClaw and the Pro backend):
verifier.score(prompt: str, cheap_answer: str, reference_answer: str | None = None) -> float
When reference_answer is None, the wrapper substitutes an empty string before tokenization.
Architecture
- Base model:
microsoft/deberta-v3-small(6 layers, 768 hidden, 128k SentencePiece vocab) - Head: 2-class sequence-pair classifier (softmax over
{reject, accept}) - Score:
P(accept)= softmax probability of class 1 - Training data: RouterBench-derived pairs labeled by a separate quality judge
- Quantization: Dynamic INT8 via
torch.ao.quantization.quantize_dynamicon Linear layers only - Production threshold:
tau = 0.80(escalate if score < 0.80) - Measured latency (M-series Mac, qnnpack, warm): ~60 ms per
(prompt, cheap, reference)call after first-call compile
What's NOT released
The reproducible inference path is open. The build pipeline behind it is not:
- Training pipeline. RouterBench-derived labeling + fine-tuning loop. Proprietary.
- Adaptive retraining. Live verifier updates from user traffic. Proprietary.
- Semantic cache + production routing infrastructure. Proprietary.
These are the active value of Nadir Pro, which runs the same architecture but with a verifier that improves every week.
Reproducing the RouterArena number
# 1. Clone the open-source router
pip install nadirclaw[trained]
# 2. Use the n2_trained profile (matches the submission)
export NADIRCLAW_TIERS_PROFILE=n2_trained
# 3. The router's wide_deep_asym_v3 classifier (bundled) + this verifier
# reproduces the routing decisions in the RouterArena prediction file.
Expect a reproducibility error bar of roughly ±0.005 vs the published 0.7358. The drift comes from local cache state (which cached model wins the cheapest-pick) and from verifier scoring against actual cheap responses vs the snapshot's proxy. Rank does not change inside that band.
Limitations
- English-only. RouterBench is English. Untested on multilingual prompts.
- Question-answer format. The verifier was trained on prompt-answer pairs from RouterBench. Performance on free-form chat with multiple turns is less reliable.
- Frozen. This snapshot does not update. Model drift will degrade quality over months as new LLMs release. For a continuously retrained verifier, see Nadir Pro.
Release notes
v1.0.0 (2026-05-29)
- Initial public release.
- FP32 (
model.safetensors) and INT8 (verifier_int8.pt) variants. - Frozen snapshot used in RouterArena PR #112 (arena_F 0.7358, rank #1 main split).
training_args.bindeliberately excluded to avoid leaking training-environment paths and hyperparameters.
Future updates will be released as cascade-verifier-v2, v3, ... rather than overwriting these weights. The v1 snapshot is permanent for reproducibility.
Citation
If you use this verifier in research, please cite:
@misc{nadir2026cascade,
title = {Verifier-Gated Cascade Routing for LLMs},
author = {Nadir Team},
year = {2026},
url = {https://github.com/NadirRouter/NadirClaw},
}
License
MIT. See the LICENSE file in the NadirClaw repository.
Contact
- Website: getnadir.com
- GitHub: NadirRouter
- Email: info@getnadir.com
- Downloads last month
- -
Model tree for nadirclaw/cascade-verifier-v1
Base model
microsoft/deberta-v3-small