HPAI-BSC
/

Llama3.1-Aloe-Beta-8B

Question Answering

text-generation

text-generation-inference

Model card Files Files and versions Community

JordiBayarri commited on Oct 31, 2024

Commit

785f462

·

verified ·

1 Parent(s): 2612bf1

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -334,11 +334,12 @@ We used [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) library. We aligned the
 #### Summary
-To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. We produce the standard MultiMedQA score for reference, by computing the weighted average accuracy on all scores except CareQA. Additionally, we calculate the arithmetic mean across all datasets. The Medical MMLU is calculated by averaging the six medical subtasks: Anatomy, Clinical knowledge, College Biology, College medicine, Medical genetics, and Professional medicine.
-Benchmark results indicate the training conducted on Aloe has boosted its performance above Llama3-8B-Instruct. Llama3-Aloe-8B-Alpha outperforms larger models like Meditron 70B, and is close to larger base models, like Yi-34. For the former, this gain is consistent even when using SC-CoT, using their best-reported variant. All these results make Llama3-Aloe-8B-Alpha the best healthcare LLM of its size.
-With the help of prompting techniques the performance of Llama3-Aloe-8B-Alpha is significantly improved. Medprompting in particular provides a 7% increase in reported accuracy, after which Llama3-Aloe-8B-Alpha only lags behind the ten times bigger Llama-3-70B-Instruct. This improvement is mostly consistent across medical fields. Llama3-Aloe-8B-Alpha with medprompting beats the performance of Meditron 70B with their self reported 20 shot SC-CoT in MMLU med and is slightly worse in the other benchmarks.
 ## Environmental Impact

 #### Summary
+To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. However, while MCQA benchmarks provide valuable insights into a model's ability to handle structured queries, they fall short in representing the full range of challenges faced in medical practice. Building upon this idea, Aloe-Beta represents the next step in the evolution of the Aloe Family, designed to broaden the scope beyond the multiple-choice question-answering tasks that defined Aloe-Alpha.
+Benchmark results indicate the training conducted on Aloe has boosted its performance above Llama31-8B-Instruct. Llama31-Aloe-Beta-8B  also outperforms other medical models like Llama3-OpenBioLLM and Llama3-Med42. All these results make Llama31-Aloe-8B-Beta the best healthcare LLM of its size.
+With the help of prompting techniques the performance of Llama3-Aloe-8B-Beta is significantly improved. Medprompting in particular provides a 7% increase in reported accuracy, after which Llama31-Aloe-8B-Beta only lags behind much bigger models like Llama-3.1-70B-Instruct or MedPalm-2. This improvement is mostly consistent across the OpenLLM Leaderboard and the other medical tasks.
 ## Environmental Impact