Post
π Today's pick in Interpretability & Analysis of LMs: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains by
@alonjacovi
@yonatanbitton
B. Bohnet J. Herzig
@orhonovic
M. Tseng M. Collins
@roeeaharoni
@mega
This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs.
In their analysis, the authors find that LM-produced CoTs generally contain faulty steps, often leading to incorrect automatic verification. In particular, CoT-generating LMs are found to produce non-attributable reasoning steps often, and reasoning verifiers generally struggle to verify logical correctness.
π Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559)
π‘ Dataset: google/reveal
This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs.
In their analysis, the authors find that LM-produced CoTs generally contain faulty steps, often leading to incorrect automatic verification. In particular, CoT-generating LMs are found to produce non-attributable reasoning steps often, and reasoning verifiers generally struggle to verify logical correctness.
π Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559)
π‘ Dataset: google/reveal