metadata

license: cc-by-nc-4.0
library_name: transformers
model-index:
  - name: SOLAR-10.7b-Instruct-dpo
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.76
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.08
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 66.06
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 71.98
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.32
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 61.03
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-dpo
          name: Open LLM Leaderboard

SOLAR-10.7b-Instruct-dpo

This model is a finetune of upstage/SOLAR-10.7B-Instruct-v1.0 using Intel/orca_dpo_pairs

Chat Template

This model follows the chatML chat template.

Evaluations

EQ Bench comparison with base model

These scores are the average of 3 iterations.

----Benchmark Complete---- + 2024-01-25 04:41:01 + Time taken: 236.1 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-dpo + Score (v2): 72.79 + Parseable: 165.67

Batch completed Time taken: 236.1 mins

as compared to the original model:

----Benchmark Complete---- + 2024-01-25 08:45:02 + Time taken: 244.0 mins + Prompt Format: ChatML + Model: upstage/SOLAR-10.7B-Instruct-v1.0 + Score (v2): 71.03 + Parseable: 165.67

Batch completed Time taken: 480.1 mins

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
SOLAR-10.7b-Instruct-dpo	47.57	74.3	72.73	45.76	60.09

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.56	±	2.81
		acc_norm	26.77	±	2.78
agieval_logiqa_en	0	acc	41.63	±	1.93
		acc_norm	41.32	±	1.93
agieval_lsat_ar	0	acc	25.22	±	2.87
		acc_norm	24.35	±	2.84
agieval_lsat_lr	0	acc	54.12	±	2.21
		acc_norm	54.31	±	2.21
agieval_lsat_rc	0	acc	68.77	±	2.83
		acc_norm	69.14	±	2.82
agieval_sat_en	0	acc	79.13	±	2.84
		acc_norm	79.13	±	2.84
agieval_sat_en_without_passage	0	acc	44.66	±	3.47
		acc_norm	44.66	±	3.47
agieval_sat_math	0	acc	40.45	±	3.32
		acc_norm	40.91	±	3.32

Average: 47.57%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	60.49	±	1.43
		acc_norm	63.74	±	1.40
arc_easy	0	acc	82.07	±	0.79
		acc_norm	79.92	±	0.82
boolq	1	acc	88.56	±	0.56
hellaswag	0	acc	68.47	±	0.46
		acc_norm	86.06	±	0.35
openbookqa	0	acc	36.20	±	2.15
		acc_norm	46.60	±	2.23
piqa	0	acc	79.38	±	0.94
		acc_norm	79.71	±	0.94
winogrande	0	acc	75.53	±	1.21

Average: 74.3%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	57.77	±	1.73
		mc2	72.73	±	1.49

Average: 72.73%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	55.26	±	3.62
bigbench_date_understanding	0	multiple_choice_grade	62.87	±	2.52
bigbench_disambiguation_qa	0	multiple_choice_grade	46.51	±	3.11
bigbench_geometric_shapes	0	multiple_choice_grade	25.63	±	2.31
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	28.00	±	2.01
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	20.57	±	1.53
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	46.67	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	41.80	±	2.21
bigbench_navigate	0	multiple_choice_grade	64.00	±	1.52
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	60.00	±	1.10
bigbench_ruin_names	0	multiple_choice_grade	39.96	±	2.32
bigbench_salient_translation_error_detection	0	multiple_choice_grade	47.90	±	1.58
bigbench_snarks	0	multiple_choice_grade	64.09	±	3.58
bigbench_sports_understanding	0	multiple_choice_grade	71.10	±	1.44
bigbench_temporal_sequences	0	multiple_choice_grade	59.90	±	1.55
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	24.96	±	1.22
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.89	±	0.92
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	46.67	±	2.89

Average: 45.76%

Average score: 60.09%

Elapsed time: 02:10:16

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	73.54
AI2 Reasoning Challenge (25-Shot)	71.76
HellaSwag (10-Shot)	88.08
MMLU (5-Shot)	66.06
TruthfulQA (0-shot)	71.98
Winogrande (5-shot)	82.32
GSM8k (5-shot)	61.03