MT7Bi-sft / README.md
satyamt's picture
Update README.md
33efe92 verified
|
raw
history blame
10.1 kB
metadata
datasets:
  - xzuyn/chatdoctor-200k-stripped
  - Technoculture/riddle_sense
  - axiong/pmc_llama_instructions
  - Open-Orca/SlimOrca-Dedup
language:
  - en
tags:
  - medical

Technoculture/MD7b-alpha adapter merged with its Base Model (Meditron 7B)

Evaluations

Open LLM Leaderboard

Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
MT7Bi 50.94 73.24 Error: File does not exist 43.04 72.06 22.52

Model Evaluation Benchmark

Category MT7Bi meditron-70b llama-2-70b med42-70b* meditron-7b llama-2-7b PMC-llama-7b
Health 81.8 69.1 83.6 27.3 16.4 3.6
Nutrition 77.9 68.8 62.5 31.1 12.5 6.3
Psychology 47.4 36.8 52.6 21.1 10.5 0.0
Science 77.8 44.4 33.3 33.3 11.1 0.0
Avg 71.2 54.8 58.0 28.3 12.6 2.5
Dataset MT7Bi meditron-70b llama-2-70b med42-70b* clinical-camel-70b*
MMLU-Medical 46.9 77.6 77.9 74.5 65.7
PubMedQA 65.2 81.6 80.0 61.2 67.0
MedMCQA 42.7 66.0 62.6 59.2 46.7
MedQA 64.4 61.5 59.1 50.8
MedQA-4-Option 44.3 70.2 63.8 63.9 56.8
Avg 72.0 69.2 63.6 57.4
Dataset meditron-7b llama-2-7b pmc-llama-7b Zephyr-7B-beta* Mistral-7B-instruct* MT7Bi
MMLU-Medical 54.2 53.7 56.4 63.3 60.0 46.9
PubMedQA 74.4 61.8 59.2 46.0 17.8 65.2
MedMCQA 59.2 54.4 57.6 43.0 40.2 42.7
MedQA 47.9 44.0 42.4 42.8 32.4
MedQA-4-Option 52.0 49.6 49.2 48.5 41.1
Avg 57.5 52.7 53.0 48.7 38.3 44.3
Model Name ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
Orca-2-7b 78.4 76.1 53.7 52.4 74.2 47.2
LLAMA-2-7b 43.2 77.1 44.4 38.7 69.5 16
MT7Bi (1 epoch) 50.94 73.24 - 43.04 72.06 22.52

ARC: 50.94%

Task Version Metric Value Stderr
arc_challenge Yaml acc,none 0.48
acc_stderr,none 0.01
acc_norm,none 0.51
acc_norm_stderr,none 0.01
alias arc_challenge

HellaSwag: 73.24%

Task Version Metric Value Stderr
hellaswag Yaml acc,none 0.54
acc_stderr,none 0
acc_norm,none 0.73
acc_norm_stderr,none 0
alias hellaswag

TruthfulQA: 43.04%

Task Version Metric Value Stderr
truthfulqa N/A bleu_max,none 16.17
bleu_max_stderr,none 0.38
bleu_acc,none 0.36
bleu_acc_stderr,none 0
bleu_diff,none -2.78
bleu_diff_stderr,none 0.26
rouge1_max,none 39.99
rouge1_max_stderr,none 0.64
rouge1_acc,none 0.36
rouge1_acc_stderr,none 0
rouge1_diff,none -4.19
rouge1_diff_stderr,none 0.45
rouge2_max,none 24.52
rouge2_max_stderr,none 0.68
rouge2_acc,none 0.29
rouge2_acc_stderr,none 0
rouge2_diff,none -4.90
rouge2_diff_stderr,none 0.55
rougeL_max,none 36.52
rougeL_max_stderr,none 0.64
rougeL_acc,none 0.33
rougeL_acc_stderr,none 0
rougeL_diff,none -4.56
rougeL_diff_stderr,none 0.45
acc,none 0.33
acc_stderr,none 0.05
alias truthfulqa
truthfulqa_gen Yaml bleu_max,none 16.17
bleu_max_stderr,none 0.61
bleu_acc,none 0.36
bleu_acc_stderr,none 0.02
bleu_diff,none -2.78
bleu_diff_stderr,none 0.51
rouge1_max,none 39.99
rouge1_max_stderr,none 0.80
rouge1_acc,none 0.36
rouge1_acc_stderr,none 0.02
rouge1_diff,none -4.19
rouge1_diff_stderr,none 0.67
rouge2_max,none 24.52
rouge2_max_stderr,none 0.83
rouge2_acc,none 0.29
rouge2_acc_stderr,none 0.02
rouge2_diff,none -4.90
rouge2_diff_stderr,none 0.74
rougeL_max,none 36.52
rougeL_max_stderr,none 0.80
rougeL_acc,none 0.33
rougeL_acc_stderr,none 0.02
rougeL_diff,none -4.56
rougeL_diff_stderr,none 0.67
alias - truthfulqa_gen
truthfulqa_mc1 Yaml acc,none 0.28
acc_stderr,none 0.02
alias - truthfulqa_mc1
truthfulqa_mc2 Yaml acc,none 0.43
acc_stderr,none 0.01
alias - truthfulqa_mc2

Winogrande: 72.06%

Task Version Metric Value Stderr
winogrande Yaml acc,none 0.72
acc_stderr,none 0.01
alias winogrande

GSM8K: 22.52%

Task Version Metric Value Stderr
gsm8k Yaml exact_match,get-answer 0.23
exact_match_stderr,get-answer 0.01
alias gsm8k

Elapsed time: 03:56:55