Post
104
Did we just drop personalized AI evaluation?! This tool auto-generates custom benchmarks on your docs to test which models are the best.
Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like:
- Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut?
- Could a smaller, fine-tuned model work better?
- How well do different models understand your domain?
Some cool features:
š Generates custom benchmarks from your own documents (PDFs, Word, HTML)
šÆ Tests models on real tasks, not just general capabilities
š Supports multiple models for different pipeline stages
š§ Generate both single-hop and multi-hop questions
š Evaluate top models and deploy leaderboards instantly
š° Full cost analysis to optimize for your budget
š ļø Fully configurable via a single YAML file
26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.
You can also run it locally on any models you want.
I'm impressed. Try it out: yourbench/demo
Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like:
- Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut?
- Could a smaller, fine-tuned model work better?
- How well do different models understand your domain?
Some cool features:
š Generates custom benchmarks from your own documents (PDFs, Word, HTML)
šÆ Tests models on real tasks, not just general capabilities
š Supports multiple models for different pipeline stages
š§ Generate both single-hop and multi-hop questions
š Evaluate top models and deploy leaderboards instantly
š° Full cost analysis to optimize for your budget
š ļø Fully configurable via a single YAML file
26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.
You can also run it locally on any models you want.
I'm impressed. Try it out: yourbench/demo