--- license: gemma datasets: - FreedomIntelligence/ApolloMoEDataset language: - ar - en - zh - ko - ja - mn - th - vi - lo - mg - de - pt - es - fr - ru - it - hr - gl - cs - co - la - uk - bs - bg - eo - sq - da - sa - 'no' - gn - sr - sk - gd - lb - hi - ku - mt - he - ln - bm - sw - ig - rw - ha metrics: - accuracy base_model: - google/gemma-2-9b pipeline_tag: question-answering tags: - biology - medical --- # Democratizing Medical LLMs For Much More Languages Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.

📃 Paper • 🌐 Demo • 🤗 ApolloMoEDataset • 🤗 ApolloMoEBench • 🤗 Models •🌐 Apollo • 🌐 ApolloMoE

![Apollo](assets/apollo_medium_final.png) ## 🌈 Update * **[2024.10.15]** ApolloMoE repo is published！🎉 ## Languages Coverage 12 Major Languages and 38 Minor Languages

Click to view the Languages Coverage

![ApolloMoE](assets/languages.png)

## Architecture

Click to view the MoE routing image

![ApolloMoE](assets/hybrid_routing.png)

## Results #### Dense 🤗 Apollo2-0.5B • 🤗 Apollo2-1.5B • 🤗 Apollo2-2B 🤗 Apollo2-3.8B • 🤗 Apollo2-7B • 🤗 Apollo2-9B

Click to view the Dense Models Results

![ApolloMoE](assets/dense_results.png)

#### Post-MoE 🤗 Apollo-MoE-0.5B • 🤗 Apollo-MoE-1.5B • 🤗 Apollo-MoE-7B

Click to view the Post-MoE Models Results

![ApolloMoE](assets/post_moe_results.png)

## Usage Format ##### Apollo2 - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|> - 2B, 9B: User:{query}\nAssistant:{response}\ - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|> ##### Apollo-MoE - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|> ## Dataset & Evaluation - Dataset 🤗 ApolloMoEDataset

Click to expand

![ApolloMoE](assets/Dataset.png) - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)

- Evaluation 🤗 ApolloMoEBench

Click to expand

- EN: - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options) - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test) - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper. - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu) - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine - ZH: - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test) - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper - Randomly sample 2,000 multiple-choice questions with single answer. - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu) - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper - Randomly sample 2,000 multiple-choice questions - ES: [Head_qa](https://huggingface.co/datasets/head_qa) - FR: - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA) - [MMLU_FR] - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi) - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic) - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA) - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA) - IT: - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA) - [MMLU_IT] - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)

## Model Download and Inference We take Apollo-MoE-0.5B as an example 1. Login Huggingface ``` huggingface-cli login --token $HUGGINGFACE_TOKEN ``` 2. Download model to local dir ```python from huggingface_hub import snapshot_download import os local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B') snapshot_download(repo_id="FreedomIntelligence/Apollo-MoE-0.5B", local_dir=local_model_dir) ``` 3. Inference Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig import os local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B') model=AutoModelForCausalLM.from_pretrained(local_model_dir,trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(local_model_dir,trust_remote_code=True) generation_config = GenerationConfig.from_pretrained(local_model_dir, pad_token_id=tokenizer.pad_token_id, num_return_sequences=1, max_new_tokens=7, min_new_tokens=2, do_sample=False, temperature=1.0, top_k=50, top_p=1.0) inputs = tokenizer('Answer direclty.\nThe capital of Mongolia is Ulaanbaatar.\nThe capital of Iceland is Reykjavik.\nThe capital of Australia is', return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs,generation_config=generation_config) print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) ``` ## Results reproduction

Click to expand

We take Apollo2-7B or Apollo-MoE-0.5B as example 1. Download Dataset for project: ``` bash 0.download_data.sh ``` 2. Prepare test and dev data for specific model: - Create test data for with special token ``` bash 1.data_process_test&dev.sh ``` 3. Prepare train data for specific model (Create tokenized data in advance): - You can adjust data Training order and Training Epoch in this step ``` bash 2.data_process_train.sh ``` 4. Train the model - If you want to train in Multi Nodes please refer to ./src/sft/training_config/zero_multi.yaml ``` bash 3.single_node_train.sh ``` 5. Evaluate your model: Generate score for benchmark ``` bash 4.eval.sh ```

## Citation Please use the following citation if you intend to use our dataset for training or evaluation: ``` @misc{zheng2024efficientlydemocratizingmedicalllms, title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts}, author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang}, year={2024}, eprint={2410.10626}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10626}, } ```