epfl-llm/meditron-7b · Is Meditron-7b better than LLaMA2-7b?

Mar 8

When I run Meditron-7b and LLaMA2-7b in LM-Harness 's multimedqa task (including MedMCQA, medQA, PubmedQA, and MMLU-medical),
the result shows that LLaMA2 is better.
What's your thought on this result?

zechen-nlp

EPFL LLM Team org Mar 8

Hello,

In general, it's very difficult for us to analyze the results given the limited information of the evaluation settings:

Did you finetune the models or are you using zero-shot prompting with the base models?
What kind of inference mode are you using?
How are you parsing the answers?
Are you using in-context learning?
If 4 is true, are you running multiple runs with different in-context examples sampled with different random seeds? For example, PubMedQA has very large variance (15 - 50) under different in-context examples.

We refer to our reported in-context learning results from the paper:

As you can see, on MedQA-5 and MedMCQA, Meditron-7B underperforms Llama-2-7b. The performances of these two models on MMLU-Medical and MedQA-4 are close. It is after fine-tuning on the datasets we observe a large performance gain.

djibe

May 30

Yes