Papers
arxiv:2606.16011

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Published on Jun 14
· Submitted by
Amir Hossein Kargaran
on Jun 16
Authors:
,
,

Abstract

Answer stability in large language models is evaluated through controlled challenges that measure response consistency when correct answers face plausible counterarguments, revealing significant variation in model reliability beyond traditional accuracy metrics.

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

Community

Models often abandon correct answers when challenged with a plausible counterargument. Across 7 frontier models and 57 MMLU subjects, flip rates ranged 17.5%–97.3%. Telling a model the argument was its own raised flips by up to ~19pp. The authors release "MaxFlip," a curated hard-challenge set, to test answer stability alongside accuracy.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.16011
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16011 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16011 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16011 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.