open-llm-leaderboard/open_llm_leaderboard · Feature Request: Running Evaluation Locally for Custom Models // Provide Documentation for 3rd Party Implementation

This is a feature request for open sourcing the evaluation code to allow for locally evaluating models that contain custom code.

This would be really useful for the research community to benchmark custom models and provide results for research papers or model cards which is currently not possible for models that require trust_remote_code=True or models that use other frameworks.

The Open LLM Leaderboard is meant to be open, and it does make sense that only models supported by a stable version of the transformers library are shown in the leaderboard. However it does not makes sense to have no way to compute these results ourselves for more exotic models. Sure we can implement the benchmarks for the included tasks, but things such as fewshot testing and metrics such as MC2 for TruthfulQA (which relies on log-prob aggregation which may or may not be normalized by token count or character count) are likely to be implemented differently by an external developer compared to the 'de facto' leaderboard implementation.

Even if you decide not to release the evaluation code itself, it would be useful to have detailed documentation describing exactly how the evaluations are performed, including any dataset preprocessing, prompt formatting, how few shot samples are chosen, and exactly how the metrics are computed.