Do we have a plan on posting the evaluation results to `open_llm_leaderboard`

#26
by mpsk - opened

Since most of LLMs are sharing their results on open_llm_leaderboard, I think it would be nice to have your result on this.

Do we have a plan on this?

mpsk changed discussion title from Do we have a plan on posting the evaluation results to [`open_llm_leaderboard`](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)? to Do we have a plan on posting the evaluation results to `open_llm_leaderboard`

Since most of LLMs are sharing their results on open_llm_leaderboard, I think it would be nice to have your result on this.

Do we have a plan on this?

This is on our radar, are you interested in the specific metric? we performed a comprehensive model's analysis and reported it in our paper: https://arxiv.org/abs//2309.11568. We performed our evaluations using eleuther harness, which is what open llm leaderboard is using at the backend, so the results should be directly comparable.

We are actively choosing the right models for our applications. And we found your work really attractive. However, we found that those results from other models, for example stablelm-3b has significant misalignment to its original technical report. Do you have any idea on this?

On the other hand, we need a fair benchmark platform that evaluates all LLM under one unified frameworks, it seems open_llm_leaderboard can be a good option. And again, thanks for the great work.

@mpsk cannot speak about stablelm folks, but in case you find a misalignment with our paper, please let me know. Totally agree with a unified framework, but it does not seem to cover a lot of useful tasks that we covered in the paper (and other papers do as well). Hopefully, we can all contribute to it to make it more useful in the future :)

Sign up or log in to comment