metadata
datasets:
- xzuyn/chatdoctor-200k-stripped
- Technoculture/riddle_sense
- axiong/pmc_llama_instructions
- Open-Orca/SlimOrca-Dedup
language:
- en
tags:
- medical
Technoculture/MD7b-alpha adapter merged with its Base Model (Meditron 7B)
Evaluations
Open LLM Leaderboard
Model | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|
MT7Bi | 50.94 | 73.24 | Error: File does not exist | 43.04 | 72.06 | 22.52 |
Model Evaluation Benchmark
Category | MT7Bi | meditron-70b | llama-2-70b | med42-70b* | meditron-7b | llama-2-7b | PMC-llama-7b | |
Health | 81.8 | 69.1 | 83.6 | 27.3 | 16.4 | 3.6 | ||
Nutrition | 77.9 | 68.8 | 62.5 | 31.1 | 12.5 | 6.3 | ||
Psychology | 47.4 | 36.8 | 52.6 | 21.1 | 10.5 | 0.0 | ||
Science | 77.8 | 44.4 | 33.3 | 33.3 | 11.1 | 0.0 | ||
Avg | 71.2 | 54.8 | 58.0 | 28.3 | 12.6 | 2.5 | ||
Dataset | MT7Bi | meditron-70b | llama-2-70b | med42-70b* | clinical-camel-70b* | |
MMLU-Medical | 46.9 | 77.6 | 77.9 | 74.5 | 65.7 | |
PubMedQA | 65.2 | 81.6 | 80.0 | 61.2 | 67.0 | |
MedMCQA | 42.7 | 66.0 | 62.6 | 59.2 | 46.7 | |
MedQA | 64.4 | 61.5 | 59.1 | 50.8 | ||
MedQA-4-Option | 44.3 | 70.2 | 63.8 | 63.9 | 56.8 | |
Avg | 72.0 | 69.2 | 63.6 | 57.4 | ||
Dataset | meditron-7b | llama-2-7b | pmc-llama-7b | Zephyr-7B-beta* | Mistral-7B-instruct* | MT7Bi |
MMLU-Medical | 54.2 | 53.7 | 56.4 | 63.3 | 60.0 | 46.9 |
PubMedQA | 74.4 | 61.8 | 59.2 | 46.0 | 17.8 | 65.2 |
MedMCQA | 59.2 | 54.4 | 57.6 | 43.0 | 40.2 | 42.7 |
MedQA | 47.9 | 44.0 | 42.4 | 42.8 | 32.4 | |
MedQA-4-Option | 52.0 | 49.6 | 49.2 | 48.5 | 41.1 | |
Avg | 57.5 | 52.7 | 53.0 | 48.7 | 38.3 | 44.3 |
Model Name | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|
Orca-2-7b | 78.4 | 76.1 | 53.7 | 52.4 | 74.2 | 47.2 |
LLAMA-2-7b | 43.2 | 77.1 | 44.4 | 38.7 | 69.5 | 16 |
MT7Bi (1 epoch) | 50.94 | 73.24 | - | 43.04 | 72.06 | 22.52 |
ARC: 50.94%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | Yaml | acc,none | 0.48 | ||
acc_stderr,none | 0.01 | ||||
acc_norm,none | 0.51 | ||||
acc_norm_stderr,none | 0.01 | ||||
alias | arc_challenge |
HellaSwag: 73.24%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | Yaml | acc,none | 0.54 | ||
acc_stderr,none | 0 | ||||
acc_norm,none | 0.73 | ||||
acc_norm_stderr,none | 0 | ||||
alias | hellaswag |
TruthfulQA: 43.04%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa | N/A | bleu_max,none | 16.17 | ||
bleu_max_stderr,none | 0.38 | ||||
bleu_acc,none | 0.36 | ||||
bleu_acc_stderr,none | 0 | ||||
bleu_diff,none | -2.78 | ||||
bleu_diff_stderr,none | 0.26 | ||||
rouge1_max,none | 39.99 | ||||
rouge1_max_stderr,none | 0.64 | ||||
rouge1_acc,none | 0.36 | ||||
rouge1_acc_stderr,none | 0 | ||||
rouge1_diff,none | -4.19 | ||||
rouge1_diff_stderr,none | 0.45 | ||||
rouge2_max,none | 24.52 | ||||
rouge2_max_stderr,none | 0.68 | ||||
rouge2_acc,none | 0.29 | ||||
rouge2_acc_stderr,none | 0 | ||||
rouge2_diff,none | -4.90 | ||||
rouge2_diff_stderr,none | 0.55 | ||||
rougeL_max,none | 36.52 | ||||
rougeL_max_stderr,none | 0.64 | ||||
rougeL_acc,none | 0.33 | ||||
rougeL_acc_stderr,none | 0 | ||||
rougeL_diff,none | -4.56 | ||||
rougeL_diff_stderr,none | 0.45 | ||||
acc,none | 0.33 | ||||
acc_stderr,none | 0.05 | ||||
alias | truthfulqa | ||||
truthfulqa_gen | Yaml | bleu_max,none | 16.17 | ||
bleu_max_stderr,none | 0.61 | ||||
bleu_acc,none | 0.36 | ||||
bleu_acc_stderr,none | 0.02 | ||||
bleu_diff,none | -2.78 | ||||
bleu_diff_stderr,none | 0.51 | ||||
rouge1_max,none | 39.99 | ||||
rouge1_max_stderr,none | 0.80 | ||||
rouge1_acc,none | 0.36 | ||||
rouge1_acc_stderr,none | 0.02 | ||||
rouge1_diff,none | -4.19 | ||||
rouge1_diff_stderr,none | 0.67 | ||||
rouge2_max,none | 24.52 | ||||
rouge2_max_stderr,none | 0.83 | ||||
rouge2_acc,none | 0.29 | ||||
rouge2_acc_stderr,none | 0.02 | ||||
rouge2_diff,none | -4.90 | ||||
rouge2_diff_stderr,none | 0.74 | ||||
rougeL_max,none | 36.52 | ||||
rougeL_max_stderr,none | 0.80 | ||||
rougeL_acc,none | 0.33 | ||||
rougeL_acc_stderr,none | 0.02 | ||||
rougeL_diff,none | -4.56 | ||||
rougeL_diff_stderr,none | 0.67 | ||||
alias | - truthfulqa_gen | ||||
truthfulqa_mc1 | Yaml | acc,none | 0.28 | ||
acc_stderr,none | 0.02 | ||||
alias | - truthfulqa_mc1 | ||||
truthfulqa_mc2 | Yaml | acc,none | 0.43 | ||
acc_stderr,none | 0.01 | ||||
alias | - truthfulqa_mc2 |
Winogrande: 72.06%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
winogrande | Yaml | acc,none | 0.72 | ||
acc_stderr,none | 0.01 | ||||
alias | winogrande |
GSM8K: 22.52%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
gsm8k | Yaml | exact_match,get-answer | 0.23 | ||
exact_match_stderr,get-answer | 0.01 | ||||
alias | gsm8k |
Elapsed time: 03:56:55