Abstract
Large language models may not genuinely detect their internal states, as their apparent introspective abilities could reflect surface-level pattern matching rather than true metacognitive monitoring.
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.
Community
Can large language models detect and report their own internal states? A number
of studies have argued that the answer to this question is yes. We argue, based on
lessons from human metacognition research, that this conclusion may be premature:
to be convinced of this conclusion we need to distinguish genuine introspection
from pattern matching based on surface-level cues. Furthermore, we argue that
behavioral evidence alone is inherently insufficient to establish strong introspective
claims.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals (2026)
- Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness (2026)
- Decomposing and Steering Functional Metacognition in Large Language Models (2026)
- Reasoning Models Know What's Important, and Encode It in Their Activations (2026)
- Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs (2026)
- Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning (2026)
- The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.26242 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper