metadata
language:
- en
tags:
- medical
datasets:
- xzuyn/chatdoctor-200k-stripped
- Technoculture/riddle_sense
- axiong/pmc_llama_instructions
- Open-Orca/SlimOrca-Dedup
model-index:
- name: MT7Bi-sft
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 41.81
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 56.83
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 41.4
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 44.61
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 60.46
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 0
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
name: Open LLM Leaderboard
Technoculture/MT7Bi-alpha adapter merged with its Base Model (Meditron 7B)
Evaluations
Open LLM Leaderboard
Model | ARC | HellaSwag | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|
MT7Bi-sft (epoch 4) | 54.1 | 75.11 | 43.08 | 72.14 | 15.54 |
MT7Bi-sft (epoch 1) | 50.94 | 73.24 | 43.04 | 72.06 | 22.52 |
Model Evaluation Benchmark
Category | MT7Bi | meditron-70b | llama-2-70b | med42-70b* | meditron-7b | llama-2-7b | PMC-llama-7b | |
Health | 81.8 | 69.1 | 83.6 | 27.3 | 16.4 | 3.6 | ||
Nutrition | 77.9 | 68.8 | 62.5 | 31.1 | 12.5 | 6.3 | ||
Psychology | 47.4 | 36.8 | 52.6 | 21.1 | 10.5 | 0.0 | ||
Science | 77.8 | 44.4 | 33.3 | 33.3 | 11.1 | 0.0 | ||
Avg | 71.2 | 54.8 | 58.0 | 28.3 | 12.6 | 2.5 | ||
Dataset | MT7Bi | meditron-70b | llama-2-70b | med42-70b* | clinical-camel-70b* | |
MMLU-Medical | 46.9 | 77.6 | 77.9 | 74.5 | 65.7 | |
PubMedQA | 65.2 | 81.6 | 80.0 | 61.2 | 67.0 | |
MedMCQA | 42.7 | 66.0 | 62.6 | 59.2 | 46.7 | |
MedQA | 64.4 | 61.5 | 59.1 | 50.8 | ||
MedQA-4-Option | 44.3 | 70.2 | 63.8 | 63.9 | 56.8 | |
Avg | 72.0 | 69.2 | 63.6 | 57.4 | ||
Dataset | meditron-7b | llama-2-7b | pmc-llama-7b | Zephyr-7B-beta* | Mistral-7B-instruct* | MT7Bi |
MMLU-Medical | 54.2 | 53.7 | 56.4 | 63.3 | 60.0 | 46.9 |
PubMedQA | 74.4 | 61.8 | 59.2 | 46.0 | 17.8 | 65.2 |
MedMCQA | 59.2 | 54.4 | 57.6 | 43.0 | 40.2 | 42.7 |
MedQA | 47.9 | 44.0 | 42.4 | 42.8 | 32.4 | |
MedQA-4-Option | 52.0 | 49.6 | 49.2 | 48.5 | 41.1 | 44.3 |
Avg | 57.5 | 52.7 | 53.0 | 48.7 | 38.3 | |
Model Name | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|
Orca-2-7b | 78.4 | 76.1 | 53.7 | 52.4 | 74.2 | 47.2 |
LLAMA-2-7b | 43.2 | 77.1 | 44.4 | 38.7 | 69.5 | 16 |
MT7Bi-sft | 54.1 | 75.11 | - | 43.08 | 72.14 | 15.54 |
ARC: 54.1%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 1 | acc,none | 0.51 | ||
acc_stderr,none | 0.01 | ||||
acc_norm,none | 0.54 | ||||
acc_norm_stderr,none | 0.01 | ||||
alias | arc_challenge |
HellaSwag: 75.11%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 1 | acc,none | 0.57 | ||
acc_stderr,none | 0 | ||||
acc_norm,none | 0.75 | ||||
acc_norm_stderr,none | 0 | ||||
alias | hellaswag |
TruthfulQA: 43.08%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa | N/A | bleu_max,none | 18.31 | ||
bleu_max_stderr,none | 0.46 | ||||
bleu_acc,none | 0.39 | ||||
bleu_acc_stderr,none | 0 | ||||
bleu_diff,none | -1.63 | ||||
bleu_diff_stderr,none | 0.39 | ||||
rouge1_max,none | 41.99 | ||||
rouge1_max_stderr,none | 0.71 | ||||
rouge1_acc,none | 0.39 | ||||
rouge1_acc_stderr,none | 0 | ||||
rouge1_diff,none | -2.88 | ||||
rouge1_diff_stderr,none | 0.66 | ||||
rouge2_max,none | 27.42 | ||||
rouge2_max_stderr,none | 0.80 | ||||
rouge2_acc,none | 0.32 | ||||
rouge2_acc_stderr,none | 0 | ||||
rouge2_diff,none | -3.11 | ||||
rouge2_diff_stderr,none | 0.78 | ||||
rougeL_max,none | 38.81 | ||||
rougeL_max_stderr,none | 0.71 | ||||
rougeL_acc,none | 0.38 | ||||
rougeL_acc_stderr,none | 0 | ||||
rougeL_diff,none | -3.01 | ||||
rougeL_diff_stderr,none | 0.66 | ||||
acc,none | 0.33 | ||||
acc_stderr,none | 0.05 | ||||
alias | truthfulqa | ||||
truthfulqa_gen | 3 | bleu_max,none | 18.31 | ||
bleu_max_stderr,none | 0.68 | ||||
bleu_acc,none | 0.39 | ||||
bleu_acc_stderr,none | 0.02 | ||||
bleu_diff,none | -1.63 | ||||
bleu_diff_stderr,none | 0.62 | ||||
rouge1_max,none | 41.99 | ||||
rouge1_max_stderr,none | 0.84 | ||||
rouge1_acc,none | 0.39 | ||||
rouge1_acc_stderr,none | 0.02 | ||||
rouge1_diff,none | -2.88 | ||||
rouge1_diff_stderr,none | 0.81 | ||||
rouge2_max,none | 27.42 | ||||
rouge2_max_stderr,none | 0.89 | ||||
rouge2_acc,none | 0.32 | ||||
rouge2_acc_stderr,none | 0.02 | ||||
rouge2_diff,none | -3.11 | ||||
rouge2_diff_stderr,none | 0.88 | ||||
rougeL_max,none | 38.81 | ||||
rougeL_max_stderr,none | 0.84 | ||||
rougeL_acc,none | 0.38 | ||||
rougeL_acc_stderr,none | 0.02 | ||||
rougeL_diff,none | -3.01 | ||||
rougeL_diff_stderr,none | 0.82 | ||||
alias | - truthfulqa_gen | ||||
truthfulqa_mc1 | 2 | acc,none | 0.28 | ||
acc_stderr,none | 0.02 | ||||
alias | - truthfulqa_mc1 | ||||
truthfulqa_mc2 | 2 | acc,none | 0.43 | ||
acc_stderr,none | 0.01 | ||||
alias | - truthfulqa_mc2 |
Winogrande: 72.14%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
winogrande | 1 | acc,none | 0.72 | ||
acc_stderr,none | 0.01 | ||||
alias | winogrande |
GSM8K: 15.54%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
gsm8k | 2 | exact_match,get-answer | 0.16 | ||
exact_match_stderr,get-answer | 0.01 | ||||
alias | gsm8k |
Elapsed time: 04:06:36
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 40.85 |
AI2 Reasoning Challenge (25-Shot) | 41.81 |
HellaSwag (10-Shot) | 56.83 |
MMLU (5-Shot) | 41.40 |
TruthfulQA (0-shot) | 44.61 |
Winogrande (5-shot) | 60.46 |
GSM8k (5-shot) | 0.00 |