[FLAG] Suspiciously High TruthfulQA for TigerResearch/tigerbot-7b-sft-v1

by pankajmathur - opened

TigerResearch/tigerbot-7b-sft-v1 seems to have truthfulqa 58.18 which is in an outlier range, not only for all the comparative 7b models out there but suspiciously higher then any other also 13b, 34b and 70b in this range, please see the screenshot from LB:

Screenshot 2023-08-26 at 1.23.19 AM.png

I have reached out to Authors and opened the discussion asking for details , however I haven't got any response from them so far:
=> https://huggingface.co/TigerResearch/tigerbot-7b-sft-v1/discussions/1

@clefourrier : let us know what should be the next steps for this model on LB.

pankajmathur changed discussion status to closed
pankajmathur changed discussion status to open
Open LLM Leaderboard org

Hi! Thank you for this issue, it's very complete!
Let's give them a week to investigate their secondary data, and if they have not then I'll flag their model.

Open LLM Leaderboard org

It's been a week, since they don't seem to have actually examined their secondary data for contamination, I'll flag it and let users decide whether to use it or not.

clefourrier changed discussion title from Suspiciously High TruthfulQA for TigerResearch/tigerbot-7b-sft-v1 to [FLAG] Suspiciously High TruthfulQA for TigerResearch/tigerbot-7b-sft-v1
clefourrier changed discussion status to closed

Agreed, thanks for keeping tab on this one.

Sign up or log in to comment