open-llm-leaderboard/open_llm_leaderboard · Questionable results for the current top models

Mar 22

Hello !
Would like to confirm the method you use to test adapters when submitted because
1- I cloned the first top 2 models on the LB and merged them with the original model
2- Run the local evaluation as suggested on /About/ tab using using this version of the Eleuther AI Harness
3- The results are much lower than what presented on the LB for example TruthfulQA mc2 is 62 while it is 79 on the LB
Therefore either you are using a different script or there is a bug somewhere when testing adapters
Thanks for this Great LB ))

ammarali32 changed discussion status to closed Mar 23

deleted

Mar 23

@ammarali32 I'm just an LLM user, and never made one myself. But I'm curious if you figured out what was happening.

After testing about a dozen LLMs with a high TruthfulQA scores I discovered that they weren't any more performant. In fact, they performed worse at retrieving esoteric knowledge and solving simple logic problems. So my theory is that a high TruthfulQA scores like 79 are just illusions caused by either contamination or stubbornness. That is, they all stuck to egregious logical and factual errors, and then started fabricating reason why they were right (e.g. said actor was cast in the role of, but dropped out due to prior obligations). So my theory is by merging the LLMs together, and to their base model, you removed some of the stubbornness, causing the artificially high TruthfulQA score to collapse back down a bit.

Note: The staff is periodically away for days at a time, including recently, which is why they never responded yet.

ammarali32

Mar 23

@Phil337 Hi ! yeah I was able to make sure that they reported misguiding base model in the model cards (https://huggingface.co/moreh/MoMo-72B-LoRA-V1.4)
In reality, they were using another model. therefore i closed the issue ))