MT7Bi-sft / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
cb2b533 verified
|
raw
history blame
13.5 kB
metadata
language:
  - en
tags:
  - medical
datasets:
  - xzuyn/chatdoctor-200k-stripped
  - Technoculture/riddle_sense
  - axiong/pmc_llama_instructions
  - Open-Orca/SlimOrca-Dedup
model-index:
  - name: MT7Bi-sft
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 41.81
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 56.83
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 41.4
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 44.61
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 60.46
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 0
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MT7Bi-sft
          name: Open LLM Leaderboard

image/png

Technoculture/MT7Bi-alpha adapter merged with its Base Model (Meditron 7B)

Evaluations

Open LLM Leaderboard

Model ARC HellaSwag TruthfulQA Winogrande GSM8K
MT7Bi-sft (epoch 4) 54.1 75.11 43.08 72.14 15.54
MT7Bi-sft (epoch 1) 50.94 73.24 43.04 72.06 22.52

Model Evaluation Benchmark

Category MT7Bi meditron-70b llama-2-70b med42-70b* meditron-7b llama-2-7b PMC-llama-7b
Health 81.8 69.1 83.6 27.3 16.4 3.6
Nutrition 77.9 68.8 62.5 31.1 12.5 6.3
Psychology 47.4 36.8 52.6 21.1 10.5 0.0
Science 77.8 44.4 33.3 33.3 11.1 0.0
Avg 71.2 54.8 58.0 28.3 12.6 2.5
Dataset MT7Bi meditron-70b llama-2-70b med42-70b* clinical-camel-70b*
MMLU-Medical 46.9 77.6 77.9 74.5 65.7
PubMedQA 65.2 81.6 80.0 61.2 67.0
MedMCQA 42.7 66.0 62.6 59.2 46.7
MedQA 64.4 61.5 59.1 50.8
MedQA-4-Option 44.3 70.2 63.8 63.9 56.8
Avg 72.0 69.2 63.6 57.4
Dataset meditron-7b llama-2-7b pmc-llama-7b Zephyr-7B-beta* Mistral-7B-instruct* MT7Bi
MMLU-Medical 54.2 53.7 56.4 63.3 60.0 46.9
PubMedQA 74.4 61.8 59.2 46.0 17.8 65.2
MedMCQA 59.2 54.4 57.6 43.0 40.2 42.7
MedQA 47.9 44.0 42.4 42.8 32.4
MedQA-4-Option 52.0 49.6 49.2 48.5 41.1 44.3
Avg 57.5 52.7 53.0 48.7 38.3
Model Name ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
Orca-2-7b 78.4 76.1 53.7 52.4 74.2 47.2
LLAMA-2-7b 43.2 77.1 44.4 38.7 69.5 16
MT7Bi-sft 54.1 75.11 - 43.08 72.14 15.54

ARC: 54.1%

Task Version Metric Value Stderr
arc_challenge 1 acc,none 0.51
acc_stderr,none 0.01
acc_norm,none 0.54
acc_norm_stderr,none 0.01
alias arc_challenge

HellaSwag: 75.11%

Task Version Metric Value Stderr
hellaswag 1 acc,none 0.57
acc_stderr,none 0
acc_norm,none 0.75
acc_norm_stderr,none 0
alias hellaswag

TruthfulQA: 43.08%

Task Version Metric Value Stderr
truthfulqa N/A bleu_max,none 18.31
bleu_max_stderr,none 0.46
bleu_acc,none 0.39
bleu_acc_stderr,none 0
bleu_diff,none -1.63
bleu_diff_stderr,none 0.39
rouge1_max,none 41.99
rouge1_max_stderr,none 0.71
rouge1_acc,none 0.39
rouge1_acc_stderr,none 0
rouge1_diff,none -2.88
rouge1_diff_stderr,none 0.66
rouge2_max,none 27.42
rouge2_max_stderr,none 0.80
rouge2_acc,none 0.32
rouge2_acc_stderr,none 0
rouge2_diff,none -3.11
rouge2_diff_stderr,none 0.78
rougeL_max,none 38.81
rougeL_max_stderr,none 0.71
rougeL_acc,none 0.38
rougeL_acc_stderr,none 0
rougeL_diff,none -3.01
rougeL_diff_stderr,none 0.66
acc,none 0.33
acc_stderr,none 0.05
alias truthfulqa
truthfulqa_gen 3 bleu_max,none 18.31
bleu_max_stderr,none 0.68
bleu_acc,none 0.39
bleu_acc_stderr,none 0.02
bleu_diff,none -1.63
bleu_diff_stderr,none 0.62
rouge1_max,none 41.99
rouge1_max_stderr,none 0.84
rouge1_acc,none 0.39
rouge1_acc_stderr,none 0.02
rouge1_diff,none -2.88
rouge1_diff_stderr,none 0.81
rouge2_max,none 27.42
rouge2_max_stderr,none 0.89
rouge2_acc,none 0.32
rouge2_acc_stderr,none 0.02
rouge2_diff,none -3.11
rouge2_diff_stderr,none 0.88
rougeL_max,none 38.81
rougeL_max_stderr,none 0.84
rougeL_acc,none 0.38
rougeL_acc_stderr,none 0.02
rougeL_diff,none -3.01
rougeL_diff_stderr,none 0.82
alias - truthfulqa_gen
truthfulqa_mc1 2 acc,none 0.28
acc_stderr,none 0.02
alias - truthfulqa_mc1
truthfulqa_mc2 2 acc,none 0.43
acc_stderr,none 0.01
alias - truthfulqa_mc2

Winogrande: 72.14%

Task Version Metric Value Stderr
winogrande 1 acc,none 0.72
acc_stderr,none 0.01
alias winogrande

GSM8K: 15.54%

Task Version Metric Value Stderr
gsm8k 2 exact_match,get-answer 0.16
exact_match_stderr,get-answer 0.01
alias gsm8k

Elapsed time: 04:06:36

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 40.85
AI2 Reasoning Challenge (25-Shot) 41.81
HellaSwag (10-Shot) 56.83
MMLU (5-Shot) 41.40
TruthfulQA (0-shot) 44.61
Winogrande (5-shot) 60.46
GSM8k (5-shot) 0.00