Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
fdaudensΒ 
posted an update 6 days ago
Post
2136
Did we just drop personalized AI evaluation?! This tool auto-generates custom benchmarks on your docs to test which models are the best.

Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like:
- Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut?
- Could a smaller, fine-tuned model work better?
- How well do different models understand your domain?

Some cool features:
πŸ“š Generates custom benchmarks from your own documents (PDFs, Word, HTML)
🎯 Tests models on real tasks, not just general capabilities
πŸ”„ Supports multiple models for different pipeline stages
🧠 Generate both single-hop and multi-hop questions
πŸ” Evaluate top models and deploy leaderboards instantly
πŸ’° Full cost analysis to optimize for your budget
πŸ› οΈ Fully configurable via a single YAML file

26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.

You can also run it locally on any models you want.

I'm impressed. Try it out: yourbench/demo
In this post