Spaces:
Building
on
CPU Upgrade
[FLAG] TigerResearch/tigerbot-70b-chat-v4-4k
98% ARC
98% Hellaswag
68% MMLU
89% Truthful
74% Wino
83% GSM
This cannot be possible.
And I guess this also confirms further https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/414
This seems to be impossible, lol. Tagging @clefourrier
thanks for the notes. we noticed that irregularity as well. although we performed initial check, but we need further investigate possible contamination, before we are sure and upload again, we will take those models down for now. we suggest take tigerbot-70b-v2/v4-4k off the leaderboard for now as well, before we double check. thanks.
I suggest that you should not remove it, and we can check if there is something wrong.
tigerboot repository disappeared? Link from leaderboard is broken.
@mirek190 Yeah, they're investigating the models. All four versions (v1, v1, v2, v4-4k) of tigetbot-70b-chat on the leaderboard have unlikely high ARC scores, as well as other benchmarks.
Nice work everyone!
In my opinion all fine-tuning data and methods should be made publicly available so that data issues like contamination can be identified and proven (LLMs can be independently re-created and evaluated).
The only exceptions should be registered corporate and academic institutions like Standford, Intel and Microsoft (e.g. Orca) because they have reputations to protect, keeping them in line.
reuploaded for further investigation: https://huggingface.co/Community-LM/tigerbot-70b-chat-v4-4k
We can still come up with a method to trace its merges with other llama2 70Bs.
can someone from HF remove them both from the leaderboard as well as any model derivated from it please?
It was flagged, and all other cheat models are flagged, not removed.
@JosephusCheung Did you mean to say "cheat" or "chat"?
@JosephusCheung Did you mean to say "cheat" or "chat"?
can someone from HF remove them both from the leaderboard as well as any model derivated from it please?
You want them removed as your model is just between those flagged ones, however it is also suspicious - You never explained any details on your models or the "UNA" method.
We shall discuss this in a new thread.
@XXXGGGNEt That's not a reasonable statement. Tigerbot scored 98 on Arc and HellaSwag, plus 89 on TruthfulQA. The models with UNA are only getting subtle boosts. These are separate issues that shouldn't be conflated.
@XXXGGGNEt That's not a reasonable statement. Tigerbot scored 98 on Arc and HellaSwag, plus 89 on TruthfulQA. The models with UNA are only getting subtle boosts. These are separate issues that shouldn't be conflated.
Then he should explain what "UNA" is.
We can discuss it in a new thread : https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444
and the boost is not subtle, it is huge.
New benchmark system should be developed.
Maybe questions should be generated by gpt-4 randomly or something Ike testing the whole wiki database randomly generated questions on each test etc
With the match is even easier generating random questions or with the coding problems
Then contamination won't be possible.
another great idea is having a community moderated questions and answers leaderboard so the community can vote on "tests" created by the community.
Edit: maybe not, as those could just be scraped by cheaters as well.
I was asked by tigerbot to remove the re-uploaded weights to avoid adverse effects from their investors.
Then I also expressed a request for their team to investigate the contamination downstream (continue training and model merging), match homologous models by layer based on low-rank features and clear the contaminated leaderboard, for example: https://huggingface.co/DopeorNope/COKAL-v1-70B
Please stay tuned for these changes.
PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4
PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4
No, this is a 13B ver., not 70B
70B v4 was removed, and I suggest that there should not be any reuploads as they promised to fix downstream issues.
PS: The model could be downloaded at: https://www.modelscope.cn/models/TigerResearch/tigerbot-13b-chat-v4
No, this is a 13B ver., not 70B
My bad, sorry for the misunderstanding.
Hi all!
Thank you for your attention and vigilance :)
The TigerBot team asked me (by email) to remove the two models from the leaderboard, as they are going to take some time to investigate their scores more in depth.
Therefore, I'm closing this issue :)
@clefourrier What about the two versions of TigerResearch/tigerbot-70b-chat-v3 still on the leaderboard? There is a decent jump in ARC between those models and the next best.