open-llm-leaderboard/open_llm_leaderboard · [FLAG] TigerResearch/tigerbot-70b-chat-v4-4k

Dec 8, 2023

98% ARC
98% Hellaswag
68% MMLU
89% Truthful
74% Wino
83% GSM

This cannot be possible.

And I guess this also confirms further https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/414

Weyaxi

Dec 8, 2023

This seems to be impossible, lol. Tagging @clefourrier

yechen

Dec 8, 2023

thanks for the notes. we noticed that irregularity as well. although we performed initial check, but we need further investigate possible contamination, before we are sure and upload again, we will take those models down for now. we suggest take tigerbot-70b-v2/v4-4k off the leaderboard for now as well, before we double check. thanks.

JosephusCheung

Dec 8, 2023

I suggest that you should not remove it, and we can check if there is something wrong.

mirek190

Dec 8, 2023

•

edited Dec 8, 2023

tigerboot repository disappeared? Link from leaderboard is broken.

TNTOutburst

Dec 8, 2023

@mirek190 Yeah, they're investigating the models. All four versions (v1, v1, v2, v4-4k) of tigetbot-70b-chat on the leaderboard have unlikely high ARC scores, as well as other benchmarks.

perlthoughts

Dec 10, 2023

Nice work everyone!

deleted

Dec 10, 2023

In my opinion all fine-tuning data and methods should be made publicly available so that data issues like contamination can be identified and proven (LLMs can be independently re-created and evaluated).

The only exceptions should be registered corporate and academic institutions like Standford, Intel and Microsoft (e.g. Orca) because they have reputations to protect, keeping them in line.

JosephusCheung

Dec 10, 2023

•

edited Dec 10, 2023

reuploaded for further investigation: https://huggingface.co/Community-LM/tigerbot-70b-chat-v4-4k

We can still come up with a method to trace its merges with other llama2 70Bs.

fblgit

Dec 10, 2023

can someone from HF remove them both from the leaderboard as well as any model derivated from it please?

JosephusCheung

Dec 10, 2023

It was flagged, and all other cheat models are flagged, not removed.

deleted

Dec 10, 2023

@JosephusCheung Did you mean to say "cheat" or "chat"?

JosephusCheung

Dec 10, 2023

@JosephusCheung Did you mean to say "cheat" or "chat"?

XXXGGGNEt

Dec 10, 2023

can someone from HF remove them both from the leaderboard as well as any model derivated from it please?

You want them removed as your model is just between those flagged ones, however it is also suspicious - You never explained any details on your models or the "UNA" method.
We shall discuss this in a new thread.

deleted

Dec 10, 2023

@XXXGGGNEt That's not a reasonable statement. Tigerbot scored 98 on Arc and HellaSwag, plus 89 on TruthfulQA. The models with UNA are only getting subtle boosts. These are separate issues that shouldn't be conflated.

XXXGGGNEt

Dec 10, 2023

•

edited Dec 10, 2023

@XXXGGGNEt That's not a reasonable statement. Tigerbot scored 98 on Arc and HellaSwag, plus 89 on TruthfulQA. The models with UNA are only getting subtle boosts. These are separate issues that shouldn't be conflated.

Then he should explain what "UNA" is.
We can discuss it in a new thread : https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444

and the boost is not subtle, it is huge.

mirek190

Dec 10, 2023

•

edited Dec 10, 2023

New benchmark system should be developed.
Maybe questions should be generated by gpt-4 randomly or something Ike testing the whole wiki database randomly generated questions on each test etc
With the match is even easier generating random questions or with the coding problems
Then contamination won't be possible.

perlthoughts

Dec 10, 2023

•

edited Dec 10, 2023

another great idea is having a community moderated questions and answers leaderboard so the community can vote on "tests" created by the community.

Edit: maybe not, as those could just be scraped by cheaters as well.

JosephusCheung

Dec 11, 2023

•

edited Dec 11, 2023

I was asked by tigerbot to remove the re-uploaded weights to avoid adverse effects from their investors.

Then I also expressed a request for their team to investigate the contamination downstream (continue training and model merging), match homologous models by layer based on low-rank features and clear the contaminated leaderboard, for example: https://huggingface.co/DopeorNope/COKAL-v1-70B

Please stay tuned for these changes.

ff670

Dec 11, 2023

PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4

JosephusCheung

Dec 11, 2023

•

edited Dec 11, 2023

PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4

No, this is a 13B ver., not 70B
70B v4 was removed, and I suggest that there should not be any reuploads as they promised to fix downstream issues.

ff670

Dec 11, 2023

PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4

No, this is a 13B ver., not 70B

My bad, sorry for the misunderstanding.

clefourrier

Open LLM Leaderboard org Dec 11, 2023

Hi all!
Thank you for your attention and vigilance :)

The TigerBot team asked me (by email) to remove the two models from the leaderboard, as they are going to take some time to investigate their scores more in depth.

Therefore, I'm closing this issue :)

clefourrier changed discussion status to closed Dec 11, 2023

TNTOutburst

Dec 11, 2023

@clefourrier What about the two versions of TigerResearch/tigerbot-70b-chat-v3 still on the leaderboard? There is a decent jump in ARC between those models and the next best.