Detailed results on MMLU-Medical

#8
by maximegmd - opened

Hello,

I am trying to run the MMLU medical tasks on Meditron 7b but obtain low results inconsistent with your reported average of 54.2, I am convinced the issue is with my testing methodology (using lm-eval-harness). Could you please communicate the results on the tasks included in MMLU-Medical for a fair comparison with other models?

Thanks in advance

EPFL LLM Team org

The 54.2 is from Meditron-7B finetuned on MedMCQA, not the base Meditron-7B model. In the paper we reported the base Meditron-7b's performance with in-context learning (3-shots, 3 run with 3 random seeds): 42.3±2.37. However, we don't have the fine-grained performance of the in-context runs.

Sign up or log in to comment