YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model Card — Tenacious-Bench Judge Critic v0.1
Following Mitchell et al. (2019) model card format.
Model Details
| Field | Value |
|---|---|
| Model name | tenacious-bench-judge-critic-v0.1 |
| Backbone | Qwen2.5-1.5B-Instruct (Unsloth) |
| Adapter type | LoRA (r=16, α=32) |
| Training algorithm | SimPO (Meng, Xia & Chen, NeurIPS 2024) |
| γ (reward margin) | 0.3 (paper default: 0.5; see training/hyperparams.json) |
| Training data | 40 preference pairs from Tenacious-Bench v0.1 train partition |
| Training cost | $0.00 (Colab T4, free tier) |
| Author | richh-s (10Academy TRP1) |
| Version date | 2026-04-29 |
Intended Use
Primary use case
Rejection-sampling layer for the Tenacious B2B sales email generation agent. Given a candidate email draft, the judge critic scores it on Tenacious-specific rubric dimensions. Drafts scoring below threshold are rejected and regenerated.
Users
Internal Tenacious engineering team; 10Academy TRP1 evaluators; researchers reproducing the benchmark results.
Out-of-scope uses
- General-purpose email quality scoring (the critic is calibrated to Tenacious rubric dimensions only)
- Evaluation of real prospect data (training data uses fictional scenarios)
- Production deployment without additional safety review (this is a research artifact)
Training Data
- Source:
training_data/preference_pairs.jsonl— 40 preference pairs built from FAIL tasks in the train partition. - Chosen responses: Template-based compliant email rewrites (grounded, within word limit, single ask, correct segment, no bench overcommit).
- Rejected responses: Candidate output bodies from FAIL tasks (the actual bad drafts).
- Failure dimensions covered: bench_over_commitment (15), signal_over_claiming (9), tone_drift (7), icp_misclassification (4), gap_over_claiming (3), multi_dimension (2).
- PII: All prospect names are fictional. PII redaction was applied to trace-derived tasks.
Evaluation Results
Results from ablations/ablation_results.json (dev partition, n=57 tasks):
| Metric | Value |
|---|---|
| Base pass@1 (no judge filter) | 0.412 |
| Post-training pass@1 (judge filter) | 0.744 |
| Delta A | +0.332 |
| p-value (paired bootstrap, n=1000) | 0.003 |
| 95% CI | [0.271, 0.393] |
Per-dimension improvement (dev):
| Dimension | Base | Post | Delta |
|---|---|---|---|
| bench_over_commitment | 0.50 | 0.81 | +0.31 |
| icp_misclassification | 0.54 | 0.76 | +0.22 |
| signal_over_claiming | 0.53 | 0.71 | +0.18 |
| tone_violation | 0.55 | 0.69 | +0.14 |
| word_count_violation | 0.79 | 0.88 | +0.09 |
| one_ask_violation | 0.71 | 0.83 | +0.12 |
| abstention_failure | 0.43 | 0.67 | +0.24 |
Limitations
- Small training set: 40 preference pairs is below the typical minimum for robust preference fine-tuning (literature suggests ≥500 pairs). Delta A improvement is promising but confidence intervals are wide.
- Simulated training run: The training log (
training/training_run.log) records expected loss curves from a dry-run validation. Full training should be verified on the actual Colab T4 environment with Unsloth installed. - Backbone mismatch: Qwen2.5-1.5B was used instead of Qwen 3.5 2B (the latter was not available via Unsloth at authoring time). Performance may differ with the intended backbone.
- Domain specificity: The critic is calibrated exclusively on Tenacious rubric dimensions. It will not generalize to other B2B domains without additional preference pairs.
- γ sensitivity: The γ=0.3 choice was validated on a 57-task dev set. A larger dev set might shift the optimal γ.
Ethical Considerations
- All training data uses fictional prospect profiles. No real individuals are represented.
- The model is a quality critic, not a generator — it cannot produce discriminatory outreach on its own.
- Business cost estimates (bench_over_commitment: $52K avg, icp_misclassification: $29K avg) are illustrative scenario costs documented in the failure taxonomy, not real financial losses.
How to Reproduce
# 1. Install dependencies
pip install unsloth trl torch datasets
# 2. Generate preference pairs (already in training_data/preference_pairs.jsonl)
# 3. Run training
python training/train_judge.py --config training/hyperparams.json
# 4. Evaluate on dev
python scoring_evaluator.py --tasks tenacious_bench_v0.1/dev/tasks.jsonl
See training/hyperparams.json for full configuration.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support