Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

811

[FLAG?] Tigerbot-70b-chat-v2 scores are too high.

#414

by TNTOutburst - opened Dec 3, 2023

Discussion

TNTOutburst

Dec 3, 2023

Tigerbot-70b-chat has a suspiciously high ARC, and Tigerbot-70b-chat-v2 has a ludicrously high ARC and TruthfulQA. This is causing Tigerbot-70b-chat-v2 to be #1 on the leaderboard when it probably shouldn't be.

joujiboi

Dec 3, 2023

•

edited Dec 3, 2023

Any reason for it to be removed other than suspicion though?

TNTOutburst

Dec 4, 2023

Any reason for it to be removed other than suspicion though?

I mean, I don't have evidence, but a finetune shouldn't be able to have that much of an advantage over other finetunes of the same model. Also, since it seems they don't like taking models off the leaderboard, just having the "has been flagged" text on it like some of the other models do is good enough. I just think it's misleading to users looking for the best model to have ones obviously trained on the test data.

deleted

Dec 4, 2023

•

edited Dec 4, 2023

@TNTOutburst I didn't want to chime in because I haven't tested the model personally due to its large size, but that was certainly the largest gain I've ever seen between a foundational model and a fine-tuned chat version, especially on the Arc test.

But since this is the official chat version from Tigerbot it's likely just an inadvertent contamination issue unless they found a way to improve its problem solving abilities. The Arc test questions are easy and only require basic knowledge so the problem solving improvement wouldn't have to be all that pronounced to drastically increase the score.

Edit: I noticed that they already released chat v4 and it's in the evaluation queue, so they may have fixed the issue.

clefourrier

Open LLM Leaderboard org Dec 4, 2023

•

edited Dec 4, 2023

Hi, thank you for this discussion!

We have a flagging mechanism, but we usually need more concrete evidence. Could you open a discussion on the model repo and ask the authors if they have an idea about this disrepancy?

clefourrier changed discussion title from Tigerbot-70b-chat-v2 scores are too high. to [FLAG?] Tigerbot-70b-chat-v2 scores are too high. Dec 4, 2023

TNTOutburst

Dec 5, 2023

https://huggingface.co/TigerResearch/tigerbot-70b-chat-v2/discussions/4
Seems kinda inconclusive

deleted

Dec 5, 2023

@TNTOutburst I'm glad you looked into it. LLM Arc scores have remained frustrating low considering how easy the test is, and how well GPT3.5 & GPT4 do on it. I was hoping Tigerbot made a breakthrough, but since the Arc score is coupled with a high TruthfulQA score an error like contamination seems more probable.

deleted

Dec 8, 2023

@TNTOutburst Look at the new chat v4 of Tigerbot 70b. 98 on Arc & 90 on TruthfulQA. I'm all for letting people who release foundational models do whatever they want with their chat versions, but this is embarrassing.

fblgit

Dec 8, 2023

•

edited Dec 8, 2023

we can pressume that it is contaminated as well..

clefourrier

Open LLM Leaderboard org Dec 13, 2023

Hi! Model has been removed per author request, as they'll investigate contamination issues more in depth before resubmitting!
Closing this issue :)

clefourrier changed discussion status to closed Dec 13, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment