metadata

license: cc-by-nc-4.0
model-index:
  - name: Lelantos-DPO-7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.08
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.22
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 67.77
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 80.03
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 68.46
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Lelantos-DPO-7B
          name: Open LLM Leaderboard

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
Lelantos-DPO-7B	45.47	75	67.05	46.64	58.54
Lelantos-7B	46.01	75	64.93	46.21	58.04

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	25.20	±	2.73
		acc_norm	24.02	±	2.69
agieval_logiqa_en	0	acc	40.71	±	1.93
		acc_norm	40.25	±	1.92
agieval_lsat_ar	0	acc	24.35	±	2.84
		acc_norm	23.04	±	2.78
agieval_lsat_lr	0	acc	55.69	±	2.20
		acc_norm	55.49	±	2.20
agieval_lsat_rc	0	acc	65.06	±	2.91
		acc_norm	65.43	±	2.91
agieval_sat_en	0	acc	76.70	±	2.95
		acc_norm	76.70	±	2.95
agieval_sat_en_without_passage	0	acc	47.09	±	3.49
		acc_norm	45.63	±	3.48
agieval_sat_math	0	acc	36.36	±	3.25
		acc_norm	33.18	±	3.18

Average: 45.47%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	62.12	±	1.42
		acc_norm	63.23	±	1.41
arc_easy	0	acc	85.40	±	0.72
		acc_norm	81.02	±	0.80
boolq	1	acc	87.25	±	0.58
hellaswag	0	acc	67.97	±	0.47
		acc_norm	85.48	±	0.35
openbookqa	0	acc	36.80	±	2.16
		acc_norm	47.20	±	2.23
piqa	0	acc	81.88	±	0.90
		acc_norm	83.57	±	0.86
winogrande	0	acc	77.27	±	1.18

Average: 75.0%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	49.94	±	1.75
		mc2	67.05	±	1.53

Average: 67.05%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	58.95	±	3.58
bigbench_date_understanding	0	multiple_choice_grade	64.23	±	2.50
bigbench_disambiguation_qa	0	multiple_choice_grade	36.43	±	3.00
bigbench_geometric_shapes	0	multiple_choice_grade	23.68	±	2.25
		exact_str_match	3.90	±	1.02
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	33.40	±	2.11
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	24.43	±	1.63
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	54.33	±	2.88
bigbench_movie_recommendation	0	multiple_choice_grade	52.20	±	2.24
bigbench_navigate	0	multiple_choice_grade	52.70	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	69.65	±	1.03
bigbench_ruin_names	0	multiple_choice_grade	50.22	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	40.98	±	1.56
bigbench_snarks	0	multiple_choice_grade	72.38	±	3.33
bigbench_sports_understanding	0	multiple_choice_grade	73.23	±	1.41
bigbench_temporal_sequences	0	multiple_choice_grade	39.90	±	1.55
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	20.88	±	1.15
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.60	±	0.91
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	54.33	±	2.88

Average: 46.64%

Average score: 58.54%

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	73.09
AI2 Reasoning Challenge (25-Shot)	71.08
HellaSwag (10-Shot)	87.22
MMLU (5-Shot)	64.00
TruthfulQA (0-shot)	67.77
Winogrande (5-shot)	80.03
GSM8k (5-shot)	68.46