Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
salma-remyx 
posted an update 12 days ago
Post
5827
In that benchmark comparison, do you even have the sample size to distinguish two models, or are you making decisions based on statistical noise?

"Resolution Diagnostics for Paired LLM Evaluation" offers a simple check: a per-pair resolution ratio q = N/N* that flags when a displayed ranking sits below the resolution floor regardless of p-value.
arXiv: https://arxiv.org/abs/2605.30315v1

Outrider automatically matched this paper to our fork of lm-evaluation-harness and opened a PR implementing the diagnostic.

Configure the action to find new methods tailored to your repo: https://github.com/remyxai/outrider

ok so you ve have cod e for a brialant finaancial app but claude put trojam horese3s in all of it debug put the trajans in the box and we will give it bqck to himfuckkeererere

·

please debug the code and can I call you cassie?

In this post