[FLAG?] Tigerbot-70b-chat-v2 scores are too high.

#414
by TNTOutburst - opened

Tigerbot-70b-chat has a suspiciously high ARC, and Tigerbot-70b-chat-v2 has a ludicrously high ARC and TruthfulQA. This is causing Tigerbot-70b-chat-v2 to be #1 on the leaderboard when it probably shouldn't be.

Any reason for it to be removed other than suspicion though?

Any reason for it to be removed other than suspicion though?

I mean, I don't have evidence, but a finetune shouldn't be able to have that much of an advantage over other finetunes of the same model. Also, since it seems they don't like taking models off the leaderboard, just having the "has been flagged" text on it like some of the other models do is good enough. I just think it's misleading to users looking for the best model to have ones obviously trained on the test data.

@TNTOutburst I didn't want to chime in because I haven't tested the model personally due to its large size, but that was certainly the largest gain I've ever seen between a foundational model and a fine-tuned chat version, especially on the Arc test.

But since this is the official chat version from Tigerbot it's likely just an inadvertent contamination issue unless they found a way to improve its problem solving abilities. The Arc test questions are easy and only require basic knowledge so the problem solving improvement wouldn't have to be all that pronounced to drastically increase the score.

Edit: I noticed that they already released chat v4 and it's in the evaluation queue, so they may have fixed the issue.

Open LLM Leaderboard org
edited Dec 4, 2023

Hi, thank you for this discussion!

We have a flagging mechanism, but we usually need more concrete evidence. Could you open a discussion on the model repo and ask the authors if they have an idea about this disrepancy?

clefourrier changed discussion title from Tigerbot-70b-chat-v2 scores are too high. to [FLAG?] Tigerbot-70b-chat-v2 scores are too high.
deleted

@TNTOutburst I'm glad you looked into it. LLM Arc scores have remained frustrating low considering how easy the test is, and how well GPT3.5 & GPT4 do on it. I was hoping Tigerbot made a breakthrough, but since the Arc score is coupled with a high TruthfulQA score an error like contamination seems more probable.

deleted

@TNTOutburst Look at the new chat v4 of Tigerbot 70b. 98 on Arc & 90 on TruthfulQA. I'm all for letting people who release foundational models do whatever they want with their chat versions, but this is embarrassing.

we can pressume that it is contaminated as well..

Open LLM Leaderboard org

Hi! Model has been removed per author request, as they'll investigate contamination issues more in depth before resubmitting!
Closing this issue :)

clefourrier changed discussion status to closed

Sign up or log in to comment