LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Abstract
A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
Community
LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]
TL;DR: A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) โ and better conversational quality doesn't guarantee better diagnosis.
Dataset Link: https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K
The Problem
Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.
What's New
This paper introduces LingxiDiagBench, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is LingxiDiag-16K โ 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.
The benchmark covers two evaluation paradigms:
- Static: Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks
- Dynamic: Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents
Four doctor consultation strategies are compared: Free-form, Symptom-Tree, APA-Guided, and APA-Guided + MRD-RAG.
Key Findings
- ๐ข Binary classification (depression vs. anxiety) is largely solved โ top models hit 92.3% accuracy
- ๐ก 4-way classification (including comorbidity) drops to 43.0% โ comorbidity recognition remains hard
- ๐ด 12-way differential diagnosis hits only 28.5% โ a substantial open challenge
- โ ๏ธ Dynamic < Static: Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning
- ๐ Consultation quality โ Diagnostic accuracy: LLM-as-a-Judge scores correlate with diagnostic accuracy at only r = 0.43, showing that asking good questions and reaching correct diagnoses are decoupled skills
- โ RAG helps: APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone
Why It Matters
LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis โ something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.
Benchmark Results Takeways
๐ Static Evaluation โ Best Model per Task
Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:
| Task | Best Model (Synthetic) | Acc (Synthetic) | Best Model (Real) | Acc (Real) |
|---|---|---|---|---|
| 2-class (Depression vs. Anxiety) | Gemini-3-Flash | 0.854 | Qwen3-4B | 0.887 |
| 4-class (+ Comorbidity + Others) | Grok-4.1-Fast | 0.470 | Qwen3-32B | 0.524 |
| 12-class (Full ICD-10 Differential) | GPT-5-Mini | 0.409 | TF-IDF + SVM | 0.320 |
| 12-class Top-3 Accuracy | TF-IDF + LR | 0.645 | Qwen3-4B | 0.698 |
| Overall Score | TF-IDF + LR | 0.533 | Qwen3-32B | 0.548 |
๐ค Dynamic Evaluation โ Best Strategy per Dataset
Performance of the end-to-end consultation pipeline (Doctor Agent โ Patient Agent โ Diagnosis), across both data settings:
| Strategy | Best Model | 2-class Acc | 4-class Acc | 12-class Acc | Clf-Ovl |
|---|---|---|---|---|---|
| Synthetic (LingxiDiag-16K) | |||||
| Free-form | Grok-4.1-Fast | 88.6% | 34.0% | 25.5% | 40.1% |
| Symptom-Tree | DeepSeek-V3.2 | 86.5% | 31.0% | 21.5% | 38.0% |
| APA-Guided | DeepSeek-V3.2 | 88.5% | 31.5% | 23.0% | 41.2% |
| APA-Guided + MRD-RAG | Grok-4.1-Fast | 88.5% | 43.0% | 28.5% | 45.4% |
| Real (LingxiDiag-Clinical) | |||||
| Free-form | Qwen3-8B | 88.8% | 40.0% | 43.0% | 49.0% |
| Symptom-Tree | GPT-OSS-20B | 91.2% | 43.0% | 44.5% | 50.0% |
| APA-Guided | Qwen3-32B | 80.0% | 36.0% | 46.5% | 48.3% |
| APA-Guided + MRD-RAG | GPT-OSS-20B | 78.8% | 37.5% | 45.5% | 47.2% |
๐ Cross-Dataset Transfer โ Does Synthetic Training Generalize to Real Data?
To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:
| Model | 12-class Acc (Real, Zero-shot) | 12-class Acc (Real, +LoRA SFT) | Gain |
|---|---|---|---|
| Qwen3-8B | 4.1% | 41.4% | +37.3% |
| Qwen3-32B | 20.4% | 39.7% | +19.3% |
The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.
Get this paper in your agent:
hf papers read 2602.09379 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
