Medical LLM evaluation, preference ranking, harm-aware evaluation, annotator disagreement, trustworthy AI, responsible AI, and evaluation datasets.