Tenacious-Bench v0.1
Tenacious-Bench v0.1 is a custom benchmark for evaluating a Tenacious-style B2B sales agent. It is built around actual workflow failures observed during development, focusing on high-stakes outreach decisions where grounding and tone are critical.
Dataset Description
- Repository: [link-to-your-repo]
- Leaderboard: N/A
- Point of Contact: [your-name/org]
Dataset Summary
A 200-task benchmark for evaluating grounded, low-hype Tenacious-style B2B outreach decisions and messages. Unlike generic email datasets, Tenacious-Bench prioritizes:
- Signal Grounding: Does the message actually use the provided evidence?
- Action Correctness: Choosing correctly between
send,exploratory_send,abstain, andreview. - Tone Adherence: Avoiding "hypy" or generic template language.
- Segment Fit: Ensuring the CTA and value proposition match the lead's specific context.
Supported Tasks and Leaderboards
The dataset is intended for evaluating LLMs acting as sales agents or critics. Evaluation is performed using the rule-based scoring_evaluator.py included in the repository.
Dataset Structure
Data Instances
Each instance consists of a lead context, hiring signals, and a company profile, paired with the expected action and a detailed rubric for scoring.
Data Fields
task_id: Unique identifier for the task.input: The context provided to the model.expected_behavior: The ground-truth action and message.rubric: Specific rules used by the evaluator.metadata: Additional context includingfamily_idfor contamination-safe splitting.
Data Splits
The dataset is partitioned into three splits using a family-based strategy to prevent pattern leakage:
- Train: 100 tasks
- Dev: 60 tasks
- Held-out: 40 tasks
Dataset Creation
Curation Rationale
Generic LLM evaluations often miss the nuance of B2B sales (e.g., when not to send an email). This benchmark was created to bridge that gap.
Source Data
- Trace-derived: 60 tasks converted from real Week 10 workflow traces.
- Programmatic: 60 tasks generated to provide balanced coverage across actions.
- Multi-LLM Synthesis: 50 tasks generated using an ensemble of models.
- Manual: 30 high-adversarial tasks hand-authored by domain experts.
Considerations for Using the Data
Discussion of Biases
The trace-derived portion of the dataset reflects the specific distribution of failures seen in the Tenacious project's early phases.
Other Known Limitations
The benchmark is relatively small (200 tasks) and focused on specific B2B outreach patterns.
Additional Information
Licensing Information
MIT License
Citation Information
Please cite the Tenacious-Bench repository if you use this dataset in your research.