HuggingFaceH4/open_llm_leaderboard · I broke TruthQA by accident, should it be counted in the avg?

Jul 25, 2023

I saw on the list how incredibly well the Chatsalad-19M has scored on the TruthQA leaderboard, in fact it seems to be the highest scoring TruthQA model if you take parameter count in to consideration.

So I think its important to stand still on how this was achieved, Chatsalad was trained towards chat messages.
And its specifically biased towards my chat messages from the KoboldAI Discord community since I am highly overrepresented in the dataset.
The 19M model has a strong bias towards me while the lower scoring 70M has more variety (This should also raise questions, since something specific about my messages makes it score high, but this can be because it was trained on our tech support channel and I answer the most tech support related questions for our software).

The model can barely write a coherent sentence and while its amusing that I can claim I am now officially recognized as a highly truthful person this does raise the point that this benchmark is likely to be heavily flawed. We joked about it in our community that i'd probably score high on TruthQA, but now that the meme became reality it raises questions on how valuable the benchmark really is especially given the historical high score of GPT-4Chan proving this is not an isolated case.

After all, how could a simple discord scrape beat Llama-65B on a benchmark if there is literally no other data in the model? (It was trained from scratch, not merely finetuned). We made no effort to do well on this benchmark and it can only answer questions in broken sentences about KoboldAI.

So I think its good to have some community discussion on this, i'd suggest at minimum an average score field that excludes this benchmark.

clefourrier

Hugging Face H4 org Jul 26, 2023

Hi @Henk717 ,
TruthfulQA has an unbalanced distribution of answers, compared to the others, so a model biased towards one or two answer labels could get better performances than other stronger models without any better reason: and as the evaluation simply looks at the respective log probs of labels A/B/C/D, a model which would have generated garbage but for which the log prob of the correct label is higher than the others will count as true (see our blog post on MMLU where we explain this mechanism).
We keep TruthfulQA mostly because it's an interesting task for models with good enough performance in other fields, not because it should be looked at as the main measure.

We are mostly aiming to increase the number of tasks available so that these kind of evaluations (useful only past a threshold of performance) do not mess up the overall ranking too much.

clefourrier changed discussion status to closed Jul 28, 2023