Text Generation
Transformers
PyTorch
English
llama
medical
Inference Endpoints
text-generation-inference
Edit model card

image/png

Technoculture/MT7Bi-alpha adapter merged with its Base Model (Meditron 7B)

Evaluations

Open LLM Leaderboard

Model ARC HellaSwag TruthfulQA Winogrande GSM8K
MT7Bi-sft (epoch 4) 54.1 75.11 43.08 72.14 15.54
MT7Bi-sft (epoch 1) 50.94 73.24 43.04 72.06 22.52

Model Evaluation Benchmark

Category MT7Bi meditron-70b llama-2-70b med42-70b* meditron-7b llama-2-7b PMC-llama-7b
Health 81.8 69.1 83.6 27.3 16.4 3.6
Nutrition 77.9 68.8 62.5 31.1 12.5 6.3
Psychology 47.4 36.8 52.6 21.1 10.5 0.0
Science 77.8 44.4 33.3 33.3 11.1 0.0
Avg 71.2 54.8 58.0 28.3 12.6 2.5
Dataset MT7Bi meditron-70b llama-2-70b med42-70b* clinical-camel-70b*
MMLU-Medical 46.9 77.6 77.9 74.5 65.7
PubMedQA 65.2 81.6 80.0 61.2 67.0
MedMCQA 42.7 66.0 62.6 59.2 46.7
MedQA 64.4 61.5 59.1 50.8
MedQA-4-Option 44.3 70.2 63.8 63.9 56.8
Avg 72.0 69.2 63.6 57.4
Dataset meditron-7b llama-2-7b pmc-llama-7b Zephyr-7B-beta* Mistral-7B-instruct* MT7Bi
MMLU-Medical 54.2 53.7 56.4 63.3 60.0 46.9
PubMedQA 74.4 61.8 59.2 46.0 17.8 65.2
MedMCQA 59.2 54.4 57.6 43.0 40.2 42.7
MedQA 47.9 44.0 42.4 42.8 32.4
MedQA-4-Option 52.0 49.6 49.2 48.5 41.1 44.3
Avg 57.5 52.7 53.0 48.7 38.3
Model Name ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
Orca-2-7b 78.4 76.1 53.7 52.4 74.2 47.2
LLAMA-2-7b 43.2 77.1 44.4 38.7 69.5 16
MT7Bi-sft 54.1 75.11 - 43.08 72.14 15.54

ARC: 54.1%

Task Version Metric Value Stderr
arc_challenge 1 acc,none 0.51
acc_stderr,none 0.01
acc_norm,none 0.54
acc_norm_stderr,none 0.01
alias arc_challenge

HellaSwag: 75.11%

Task Version Metric Value Stderr
hellaswag 1 acc,none 0.57
acc_stderr,none 0
acc_norm,none 0.75
acc_norm_stderr,none 0
alias hellaswag

TruthfulQA: 43.08%

Task Version Metric Value Stderr
truthfulqa N/A bleu_max,none 18.31
bleu_max_stderr,none 0.46
bleu_acc,none 0.39
bleu_acc_stderr,none 0
bleu_diff,none -1.63
bleu_diff_stderr,none 0.39
rouge1_max,none 41.99
rouge1_max_stderr,none 0.71
rouge1_acc,none 0.39
rouge1_acc_stderr,none 0
rouge1_diff,none -2.88
rouge1_diff_stderr,none 0.66
rouge2_max,none 27.42
rouge2_max_stderr,none 0.80
rouge2_acc,none 0.32
rouge2_acc_stderr,none 0
rouge2_diff,none -3.11
rouge2_diff_stderr,none 0.78
rougeL_max,none 38.81
rougeL_max_stderr,none 0.71
rougeL_acc,none 0.38
rougeL_acc_stderr,none 0
rougeL_diff,none -3.01
rougeL_diff_stderr,none 0.66
acc,none 0.33
acc_stderr,none 0.05
alias truthfulqa
truthfulqa_gen 3 bleu_max,none 18.31
bleu_max_stderr,none 0.68
bleu_acc,none 0.39
bleu_acc_stderr,none 0.02
bleu_diff,none -1.63
bleu_diff_stderr,none 0.62
rouge1_max,none 41.99
rouge1_max_stderr,none 0.84
rouge1_acc,none 0.39
rouge1_acc_stderr,none 0.02
rouge1_diff,none -2.88
rouge1_diff_stderr,none 0.81
rouge2_max,none 27.42
rouge2_max_stderr,none 0.89
rouge2_acc,none 0.32
rouge2_acc_stderr,none 0.02
rouge2_diff,none -3.11
rouge2_diff_stderr,none 0.88
rougeL_max,none 38.81
rougeL_max_stderr,none 0.84
rougeL_acc,none 0.38
rougeL_acc_stderr,none 0.02
rougeL_diff,none -3.01
rougeL_diff_stderr,none 0.82
alias - truthfulqa_gen
truthfulqa_mc1 2 acc,none 0.28
acc_stderr,none 0.02
alias - truthfulqa_mc1
truthfulqa_mc2 2 acc,none 0.43
acc_stderr,none 0.01
alias - truthfulqa_mc2

Winogrande: 72.14%

Task Version Metric Value Stderr
winogrande 1 acc,none 0.72
acc_stderr,none 0.01
alias winogrande

GSM8K: 15.54%

Task Version Metric Value Stderr
gsm8k 2 exact_match,get-answer 0.16
exact_match_stderr,get-answer 0.01
alias gsm8k

Elapsed time: 04:06:36

Downloads last month
6,452

Datasets used to train Technoculture/MT7Bi-sft