metadata

language:
  - en
tags:
  - medical
datasets:
  - xzuyn/chatdoctor-200k-stripped
  - Technoculture/riddle_sense
  - axiong/pmc_llama_instructions
  - Open-Orca/SlimOrca-Dedup
model-index:
  - name: MT7Bi-sft
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 41.81
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 56.83
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 41.4
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 44.61
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 60.46
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 0
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard

Technoculture/MT7Bi-alpha adapter merged with its Base Model (Meditron 7B)

Evaluations

Open LLM Leaderboard

Model	ARC	HellaSwag	TruthfulQA	Winogrande	GSM8K
MT7Bi-sft (epoch 4)	54.1	75.11	43.08	72.14	15.54
MT7Bi-sft (epoch 1)	50.94	73.24	43.04	72.06	22.52

Model Evaluation Benchmark


Category	MT7Bi	meditron-70b	llama-2-70b	med42-70b*	meditron-7b	llama-2-7b	PMC-llama-7b
Health		81.8	69.1	83.6	27.3	16.4	3.6
Nutrition		77.9	68.8	62.5	31.1	12.5	6.3
Psychology		47.4	36.8	52.6	21.1	10.5	0.0
Science		77.8	44.4	33.3	33.3	11.1	0.0
Avg		71.2	54.8	58.0	28.3	12.6	2.5


Dataset	MT7Bi	meditron-70b	llama-2-70b	med42-70b*	clinical-camel-70b*
MMLU-Medical	46.9	77.6	77.9	74.5	65.7
PubMedQA	65.2	81.6	80.0	61.2	67.0
MedMCQA	42.7	66.0	62.6	59.2	46.7
MedQA		64.4	61.5	59.1	50.8
MedQA-4-Option	44.3	70.2	63.8	63.9	56.8
Avg		72.0	69.2	63.6	57.4


Dataset	meditron-7b	llama-2-7b	pmc-llama-7b	Zephyr-7B-beta*	Mistral-7B-instruct*	MT7Bi
MMLU-Medical	54.2	53.7	56.4	63.3	60.0	46.9
PubMedQA	74.4	61.8	59.2	46.0	17.8	65.2
MedMCQA	59.2	54.4	57.6	43.0	40.2	42.7
MedQA	47.9	44.0	42.4	42.8	32.4
MedQA-4-Option	52.0	49.6	49.2	48.5	41.1	44.3
Avg	57.5	52.7	53.0	48.7	38.3

Model Name	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
Orca-2-7b	78.4	76.1	53.7	52.4	74.2	47.2
LLAMA-2-7b	43.2	77.1	44.4	38.7	69.5	16
MT7Bi-sft	54.1	75.11	-	43.08	72.14	15.54

ARC: 54.1%

Task	Version	Metric	Value
arc_challenge	1	acc,none	0.51
		acc_stderr,none	0.01
		acc_norm,none	0.54
		acc_norm_stderr,none	0.01
		alias	arc_challenge

HellaSwag: 75.11%

Task	Version	Metric	Value
hellaswag	1	acc,none	0.57
		acc_stderr,none	0
		acc_norm,none	0.75
		acc_norm_stderr,none	0
		alias	hellaswag

TruthfulQA: 43.08%

Task	Version	Metric	Value
truthfulqa	N/A	bleu_max,none	18.31
		bleu_max_stderr,none	0.46
		bleu_acc,none	0.39
		bleu_acc_stderr,none	0
		bleu_diff,none	-1.63
		bleu_diff_stderr,none	0.39
		rouge1_max,none	41.99
		rouge1_max_stderr,none	0.71
		rouge1_acc,none	0.39
		rouge1_acc_stderr,none	0
		rouge1_diff,none	-2.88
		rouge1_diff_stderr,none	0.66
		rouge2_max,none	27.42
		rouge2_max_stderr,none	0.80
		rouge2_acc,none	0.32
		rouge2_acc_stderr,none	0
		rouge2_diff,none	-3.11
		rouge2_diff_stderr,none	0.78
		rougeL_max,none	38.81
		rougeL_max_stderr,none	0.71
		rougeL_acc,none	0.38
		rougeL_acc_stderr,none	0
		rougeL_diff,none	-3.01
		rougeL_diff_stderr,none	0.66
		acc,none	0.33
		acc_stderr,none	0.05
		alias	truthfulqa
truthfulqa_gen	3	bleu_max,none	18.31
		bleu_max_stderr,none	0.68
		bleu_acc,none	0.39
		bleu_acc_stderr,none	0.02
		bleu_diff,none	-1.63
		bleu_diff_stderr,none	0.62
		rouge1_max,none	41.99
		rouge1_max_stderr,none	0.84
		rouge1_acc,none	0.39
		rouge1_acc_stderr,none	0.02
		rouge1_diff,none	-2.88
		rouge1_diff_stderr,none	0.81
		rouge2_max,none	27.42
		rouge2_max_stderr,none	0.89
		rouge2_acc,none	0.32
		rouge2_acc_stderr,none	0.02
		rouge2_diff,none	-3.11
		rouge2_diff_stderr,none	0.88
		rougeL_max,none	38.81
		rougeL_max_stderr,none	0.84
		rougeL_acc,none	0.38
		rougeL_acc_stderr,none	0.02
		rougeL_diff,none	-3.01
		rougeL_diff_stderr,none	0.82
		alias	- truthfulqa_gen
truthfulqa_mc1	2	acc,none	0.28
		acc_stderr,none	0.02
		alias	- truthfulqa_mc1
truthfulqa_mc2	2	acc,none	0.43
		acc_stderr,none	0.01
		alias	- truthfulqa_mc2

Winogrande: 72.14%

Task	Version	Metric	Value
winogrande	1	acc,none	0.72
		acc_stderr,none	0.01
		alias	winogrande

GSM8K: 15.54%

Task	Version	Metric	Value
gsm8k	2	exact_match,get-answer	0.16
		exact_match_stderr,get-answer	0.01
		alias	gsm8k

Elapsed time: 04:06:36

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	40.85
AI2 Reasoning Challenge (25-Shot)	41.81
HellaSwag (10-Shot)	56.83
MMLU (5-Shot)	41.40
TruthfulQA (0-shot)	44.61
Winogrande (5-shot)	60.46
GSM8k (5-shot)	0.00