Papers
arxiv:2401.05561

TrustLLM: Trustworthiness in Large Language Models

Published on Jan 10
· Featured in Daily Papers on Jan 12
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

Community

Exactly how is trustworthiness innately related positively to utility? After reading this paper i'm more confused, because a whole bunch of the utility of current models comes down to toxicity and semantic filtering. Also, the "raising concerns about open source models" is silly and ridiculous. Those that can't handle technology won't be using open models in the first place. That raises the obvious question, what large model owners sponsored this study to sow fear and doubt? Knowing something is able to be trusted is different from trust = effectiveness. All depends on your use case and what bias you're bringing to the table.

Agree on one point though, gatekeepers and routers need to be built differently than regular models to be true ground-truth machines - and we all know that nobody, not even the big kids, has that part down to science yet.

Exactly how is trustworthiness innately related positively to utility?

I share your sentiment and I think the answer lies in the difference between what developers want and what business executives want. Businesses want woke models that reinforce their worldview and don't hurt anyone's feelings. Developers don't necessarily care about your feelings, we want our models to do exactly what they're told to do, when they're told to do it. Utility (unrestricted instruction following (aka logic)) will prevail, but I imagine we'll end up with uncensored models as a backend, with a censored model on the front end. That way everybody can have their cake and eat it.

Can you imagine what would happen if a company released a new programming language, that scanned your variables for toxic content and sent them to the garbage collector? I have a slight suspicion that language would be unceremoniously dumped alongside Windows 8.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.05561 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.05561 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.05561 in a Space README.md to link it from this page.

Collections including this paper 27