arxiv:2602.09379

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Published on Jun 11

· Submitted by

Xu Shihao on Jun 24

Lyncia

Upvote

Authors:

Abstract

A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

XuShihao6715

Paper submitter about 7 hours ago

•

edited about 6 hours ago

LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]

TL;DR: A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) — and better conversational quality doesn't guarantee better diagnosis.

Dataset Link: https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K

The Problem

Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.

What's New

This paper introduces LingxiDiagBench, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is LingxiDiag-16K — 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.

The benchmark covers two evaluation paradigms:

Static: Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks
Dynamic: Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents

Four doctor consultation strategies are compared: Free-form, Symptom-Tree, APA-Guided, and APA-Guided + MRD-RAG.

Key Findings

🟢 Binary classification (depression vs. anxiety) is largely solved — top models hit 92.3% accuracy
🟡 4-way classification (including comorbidity) drops to 43.0% — comorbidity recognition remains hard
🔴 12-way differential diagnosis hits only 28.5% — a substantial open challenge
⚠️ Dynamic < Static: Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning
🔍 Consultation quality ≠ Diagnostic accuracy: LLM-as-a-Judge scores correlate with diagnostic accuracy at only r = 0.43, showing that asking good questions and reaching correct diagnoses are decoupled skills
✅ RAG helps: APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone

Why It Matters

LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis — something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.

Benchmark Results Takeways

📊 Static Evaluation — Best Model per Task

Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:

Task	Best Model (Synthetic)	Acc (Synthetic)	Best Model (Real)	Acc (Real)
2-class (Depression vs. Anxiety)	Gemini-3-Flash	0.854	Qwen3-4B	0.887
4-class (+ Comorbidity + Others)	Grok-4.1-Fast	0.470	Qwen3-32B	0.524
12-class (Full ICD-10 Differential)	GPT-5-Mini	0.409	TF-IDF + SVM	0.320
12-class Top-3 Accuracy	TF-IDF + LR	0.645	Qwen3-4B	0.698
Overall Score	TF-IDF + LR	0.533	Qwen3-32B	0.548

🤖 Dynamic Evaluation — Best Strategy per Dataset

Performance of the end-to-end consultation pipeline (Doctor Agent → Patient Agent → Diagnosis), across both data settings:

Strategy	Best Model	2-class Acc	4-class Acc	12-class Acc	Clf-Ovl
Synthetic (LingxiDiag-16K)
Free-form	Grok-4.1-Fast	88.6%	34.0%	25.5%	40.1%
Symptom-Tree	DeepSeek-V3.2	86.5%	31.0%	21.5%	38.0%
APA-Guided	DeepSeek-V3.2	88.5%	31.5%	23.0%	41.2%
APA-Guided + MRD-RAG	Grok-4.1-Fast	88.5%	43.0%	28.5%	45.4%
Real (LingxiDiag-Clinical)
Free-form	Qwen3-8B	88.8%	40.0%	43.0%	49.0%
Symptom-Tree	GPT-OSS-20B	91.2%	43.0%	44.5%	50.0%
APA-Guided	Qwen3-32B	80.0%	36.0%	46.5%	48.3%
APA-Guided + MRD-RAG	GPT-OSS-20B	78.8%	37.5%	45.5%	47.2%

🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?

To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:

Model	12-class Acc (Real, Zero-shot)	12-class Acc (Real, +LoRA SFT)	Gain
Qwen3-8B	4.1%	41.4%	+37.3%
Qwen3-32B	20.4%	39.7%	+19.3%

The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.