Edit model card

SOLAR-10.7b-Instruct-truthy-dpo

orca-bagel

This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Process

  1. I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
  2. I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
  3. This process is experimental and the base model linked above is more tested at this time.

GGUF

Available here

Evaluations

----Benchmark Complete----

  • 2024-01-26 20:57:38
  • Time taken: 25.4 mins
  • Prompt Format: ChatML
  • Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF
  • Score (v2): 74.11
  • Parseable: 171.0

Batch completed Time taken: 25.5 mins


Model AGIEval GPT4All TruthfulQA Bigbench Average
SOLAR-10.7b-Instruct-truthy-dpo 48.69 73.82 76.81 45.71 61.26

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.95 ± 2.82
acc_norm 27.95 ± 2.82
agieval_logiqa_en 0 acc 42.40 ± 1.94
acc_norm 42.24 ± 1.94
agieval_lsat_ar 0 acc 25.65 ± 2.89
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 54.12 ± 2.21
acc_norm 54.51 ± 2.21
agieval_lsat_rc 0 acc 69.89 ± 2.80
acc_norm 69.89 ± 2.80
agieval_sat_en 0 acc 80.10 ± 2.79
acc_norm 80.10 ± 2.79
agieval_sat_en_without_passage 0 acc 50.00 ± 3.49
acc_norm 49.51 ± 3.49
agieval_sat_math 0 acc 42.27 ± 3.34
acc_norm 41.36 ± 3.33

Average: 48.69%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 59.90 ± 1.43
acc_norm 63.91 ± 1.40
arc_easy 0 acc 80.85 ± 0.81
acc_norm 78.16 ± 0.85
boolq 1 acc 88.20 ± 0.56
hellaswag 0 acc 68.34 ± 0.46
acc_norm 86.39 ± 0.34
openbookqa 0 acc 37.60 ± 2.17
acc_norm 46.80 ± 2.23
piqa 0 acc 78.84 ± 0.95
acc_norm 78.78 ± 0.95
winogrande 0 acc 74.51 ± 1.22

Average: 73.82%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 61.81 ± 1.70
mc2 76.81 ± 1.42

Average: 76.81%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 50.53 ± 3.64
bigbench_date_understanding 0 multiple_choice_grade 63.14 ± 2.51
bigbench_disambiguation_qa 0 multiple_choice_grade 47.67 ± 3.12
bigbench_geometric_shapes 0 multiple_choice_grade 26.18 ± 2.32
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.60 ± 2.02
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 21.29 ± 1.55
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 47.33 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 39.80 ± 2.19
bigbench_navigate 0 multiple_choice_grade 63.80 ± 1.52
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 59.05 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 40.18 ± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 46.69 ± 1.58
bigbench_snarks 0 multiple_choice_grade 65.19 ± 3.55
bigbench_sports_understanding 0 multiple_choice_grade 72.41 ± 1.42
bigbench_temporal_sequences 0 multiple_choice_grade 60.30 ± 1.55
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 25.76 ± 1.24
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.43 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 47.33 ± 2.89

Average: 45.71%

Average score: 61.26%

Elapsed time: 02:16:03

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 74.11
AI2 Reasoning Challenge (25-Shot) 72.10
HellaSwag (10-Shot) 88.44
MMLU (5-Shot) 65.45
TruthfulQA (0-shot) 76.75
Winogrande (5-shot) 82.72
GSM8k (5-shot) 59.21
Downloads last month
2,889
Safetensors
Model size
10.7B params
Tensor type
FP16
·

Collection including macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Evaluation results