arxiv:2606.29985

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Published on Jun 29

· Submitted by

Lee SangMook on Jul 1

Seoul National University

Upvote

Authors:

Sangmook Lee ,

Minbeom Kim ,

Abstract

Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

View arXiv page View PDF Add to collection

Community

sangmook12

Paper author Paper submitter about 11 hours ago

When LLMs appear to give diverse math solutions, are they truly exploring different strategies—or merely rephrasing the same one? We address this question through approach-level diversity, which captures whether solutions differ in how they solve the problem, not just in how they are written.