JordiBayarri commited on
Commit
785f462
verified
1 Parent(s): 2612bf1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -334,11 +334,12 @@ We used [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) library. We aligned the
334
 
335
  #### Summary
336
 
337
- To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. We produce the standard MultiMedQA score for reference, by computing the weighted average accuracy on all scores except CareQA. Additionally, we calculate the arithmetic mean across all datasets. The Medical MMLU is calculated by averaging the six medical subtasks: Anatomy, Clinical knowledge, College Biology, College medicine, Medical genetics, and Professional medicine.
338
 
339
- Benchmark results indicate the training conducted on Aloe has boosted its performance above Llama3-8B-Instruct. Llama3-Aloe-8B-Alpha outperforms larger models like Meditron 70B, and is close to larger base models, like Yi-34. For the former, this gain is consistent even when using SC-CoT, using their best-reported variant. All these results make Llama3-Aloe-8B-Alpha the best healthcare LLM of its size.
340
 
341
- With the help of prompting techniques the performance of Llama3-Aloe-8B-Alpha is significantly improved. Medprompting in particular provides a 7% increase in reported accuracy, after which Llama3-Aloe-8B-Alpha only lags behind the ten times bigger Llama-3-70B-Instruct. This improvement is mostly consistent across medical fields. Llama3-Aloe-8B-Alpha with medprompting beats the performance of Meditron 70B with their self reported 20 shot SC-CoT in MMLU med and is slightly worse in the other benchmarks.
 
 
342
 
343
  ## Environmental Impact
344
 
 
334
 
335
  #### Summary
336
 
337
+ To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. However, while MCQA benchmarks provide valuable insights into a model's ability to handle structured queries, they fall short in representing the full range of challenges faced in medical practice. Building upon this idea, Aloe-Beta represents the next step in the evolution of the Aloe Family, designed to broaden the scope beyond the multiple-choice question-answering tasks that defined Aloe-Alpha.
338
 
 
339
 
340
+ Benchmark results indicate the training conducted on Aloe has boosted its performance above Llama31-8B-Instruct. Llama31-Aloe-Beta-8B also outperforms other medical models like Llama3-OpenBioLLM and Llama3-Med42. All these results make Llama31-Aloe-8B-Beta the best healthcare LLM of its size.
341
+
342
+ With the help of prompting techniques the performance of Llama3-Aloe-8B-Beta is significantly improved. Medprompting in particular provides a 7% increase in reported accuracy, after which Llama31-Aloe-8B-Beta only lags behind much bigger models like Llama-3.1-70B-Instruct or MedPalm-2. This improvement is mostly consistent across the OpenLLM Leaderboard and the other medical tasks.
343
 
344
  ## Environmental Impact
345