open-llm-leaderboard/open_llm_leaderboard · Is inference run improperly for CohereForAI/c4ai-command-r-plus?

Apr 6, 2024

•

edited Apr 6, 2024

I noticed that CohereForAI/c4ai-command-r-plus gets horrible performance (random guessing) on all the benchmarks; it's a 100-billion parameter model that shouldn't be performing this poorly.

Is the inference done correctly here?

EDIT: According to https://huggingface.co/CohereForAI/c4ai-command-r-plus, the results should be much better (MMLU of 75.7)

deleted

Apr 7, 2024

This comment has been hidden

clefourrier

Open LLM Leaderboard org Apr 8, 2024

Hi!
Since this model is very impactful/important for the community, we ran the evaluations manually, using a version of transformers (which covers their code) built from main.
However, our backend has not yet been updated with the related transformers version (we are waiting for the next stable release, which will include the above update), and when users submit this specific model, it runs on a transformers version which does not cover the CohereForAI/c4ai-command-r-plus compatible code - hence, the weights are loaded randomly, and the evaluation results are random too.

I just removed the faulty results from our front end, and correct results should appear once it's rebuilt - I also added a check to prevent people from submitting the model again before we update the backend.

clefourrier changed discussion status to closed Apr 9, 2024