openlifescienceai/open_medical_llm_leaderboard · Comparing Logits-Based and Text-Based Evaluation Methods

Thank you for this wonderful project! I have some concerns regarding the evaluation methods currently in use. It appears that the project relies on the lm-evaluation-harness, which computes likelihoods for multiple-choice type tasks (logits-based). While this approach may be suitable for pre-trained models, it may not be the best fit for evaluating chat/instructed models.

For evaluating chat/instructive models, a method that extracts answers from freely generated content (text-based) could potentially offer a more accurate reflection of the model's true performance [1]. I am unsure whether such a method is currently being used in your evaluations.

Additionally, the choice of chat templates and the training format could significantly influence the results. How are these factors accounted for in your evaluation process?

I look forward to your insights on this topic. Thank you!

[1] Wang X, Hu C, Ma B, et al. Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think[J]. arXiv preprint arXiv:2404.08382, 2024.