Reasoning behind including TruthfulQA?

#10
by Phil337 - opened

TruthfulQA isn't like the other LLM tests were higher scores, sans contamination, is always a good thing.

There are only two main ways to increase TruthfulQA scores over the foundational model (1) contamination (and 2) truth denialism.

Every model I've ever tested with high TruthfulQA, such as those designed for RAG, denied far more truths, and in direct proportion to their scores (assuming no contamination). And denying a truth (e.g. Kate Hudson wasn't in a movie she was in fact in) is no less of a falsehood than saying a false thing is true (e.g. Kate Hudson was in a movie she wasn't in).

In conclusion, TruthfulQA isn't a simple measure of performance. Consequently, adding it to any leaderboard only encourages LLM makers to create inferior LLMs to climb higher on the board by either adding contamination or increasing falsehoods by fine-tuning LLMs to be stubborn truth deniers, denying countless 1000s more truths for every extra falsehood avoided, hence boosting the TruthfulQA score a tiny bit, which isn't remotely a good tradeoff unless you're implementing RAG.

I agree, I didn't design this benchmark suite. I added it to be able to compare my results with NousResearch's. Personally, I don't rely on TruthfulQA at all and any suite that I make will not include it.

Phil337 changed discussion status to closed

Sign up or log in to comment