macadeliccc's picture
Adding Evaluation Results (#1)
16fcafb verified
metadata
license: cc-by-nc-4.0
library_name: transformers
model-index:
  - name: SOLAR-10.7b-Instruct-dpo
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.76
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.08
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 66.06
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 71.98
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.32
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 61.03
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard

SOLAR-10.7b-Instruct-dpo

orca-header

This model is a finetune of upstage/SOLAR-10.7B-Instruct-v1.0 using Intel/orca_dpo_pairs

Chat Template

This model follows the chatML chat template.

Evaluations

EQ Bench comparison with base model

These scores are the average of 3 iterations.

----Benchmark Complete---- + 2024-01-25 04:41:01 + Time taken: 236.1 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-dpo + Score (v2): 72.79 + Parseable: 165.67

Batch completed Time taken: 236.1 mins

as compared to the original model:

----Benchmark Complete---- + 2024-01-25 08:45:02 + Time taken: 244.0 mins + Prompt Format: ChatML + Model: upstage/SOLAR-10.7B-Instruct-v1.0 + Score (v2): 71.03 + Parseable: 165.67

Batch completed Time taken: 480.1 mins

Model AGIEval GPT4All TruthfulQA Bigbench Average
SOLAR-10.7b-Instruct-dpo 47.57 74.3 72.73 45.76 60.09

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.56 ± 2.81
acc_norm 26.77 ± 2.78
agieval_logiqa_en 0 acc 41.63 ± 1.93
acc_norm 41.32 ± 1.93
agieval_lsat_ar 0 acc 25.22 ± 2.87
acc_norm 24.35 ± 2.84
agieval_lsat_lr 0 acc 54.12 ± 2.21
acc_norm 54.31 ± 2.21
agieval_lsat_rc 0 acc 68.77 ± 2.83
acc_norm 69.14 ± 2.82
agieval_sat_en 0 acc 79.13 ± 2.84
acc_norm 79.13 ± 2.84
agieval_sat_en_without_passage 0 acc 44.66 ± 3.47
acc_norm 44.66 ± 3.47
agieval_sat_math 0 acc 40.45 ± 3.32
acc_norm 40.91 ± 3.32

Average: 47.57%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 60.49 ± 1.43
acc_norm 63.74 ± 1.40
arc_easy 0 acc 82.07 ± 0.79
acc_norm 79.92 ± 0.82
boolq 1 acc 88.56 ± 0.56
hellaswag 0 acc 68.47 ± 0.46
acc_norm 86.06 ± 0.35
openbookqa 0 acc 36.20 ± 2.15
acc_norm 46.60 ± 2.23
piqa 0 acc 79.38 ± 0.94
acc_norm 79.71 ± 0.94
winogrande 0 acc 75.53 ± 1.21

Average: 74.3%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 57.77 ± 1.73
mc2 72.73 ± 1.49

Average: 72.73%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 55.26 ± 3.62
bigbench_date_understanding 0 multiple_choice_grade 62.87 ± 2.52
bigbench_disambiguation_qa 0 multiple_choice_grade 46.51 ± 3.11
bigbench_geometric_shapes 0 multiple_choice_grade 25.63 ± 2.31
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.00 ± 2.01
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 20.57 ± 1.53
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 46.67 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 41.80 ± 2.21
bigbench_navigate 0 multiple_choice_grade 64.00 ± 1.52
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 60.00 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 39.96 ± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 47.90 ± 1.58
bigbench_snarks 0 multiple_choice_grade 64.09 ± 3.58
bigbench_sports_understanding 0 multiple_choice_grade 71.10 ± 1.44
bigbench_temporal_sequences 0 multiple_choice_grade 59.90 ± 1.55
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 24.96 ± 1.22
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.89 ± 0.92
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 46.67 ± 2.89

Average: 45.76%

Average score: 60.09%

Elapsed time: 02:10:16

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 73.54
AI2 Reasoning Challenge (25-Shot) 71.76
HellaSwag (10-Shot) 88.08
MMLU (5-Shot) 66.06
TruthfulQA (0-shot) 71.98
Winogrande (5-shot) 82.32
GSM8k (5-shot) 61.03