Is Meditron-7b better than LLaMA2-7b?

by sean0042 - opened

When I run Meditron-7b and LLaMA2-7b in LM-Harness 's multimedqa task (including MedMCQA, medQA, PubmedQA, and MMLU-medical),
the result shows that LLaMA2 is better.
What's your thought on this result?

EPFL LLM Team org


In general, it's very difficult for us to analyze the results given the limited information of the evaluation settings:

  1. Did you finetune the models or are you using zero-shot prompting with the base models?
  2. What kind of inference mode are you using?
  3. How are you parsing the answers?
  4. Are you using in-context learning?
  5. If 4 is true, are you running multiple runs with different in-context examples sampled with different random seeds? For example, PubMedQA has very large variance (15 - 50) under different in-context examples.

We refer to our reported in-context learning results from the paper:

As you can see, on MedQA-5 and MedMCQA, Meditron-7B underperforms Llama-2-7b. The performances of these two models on MMLU-Medical and MedQA-4 are close. It is after fine-tuning on the datasets we observe a large performance gain.



Sign up or log in to comment