Edit model card

SOLAR-10.7b-Instruct-truthy-dpo

orca-bagel

This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Process

  1. I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
  2. I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
  3. This process is experimental and the base model linked above is more tested at this time.

GGUF

Available here

Evaluations

----Benchmark Complete---- + 2024-01-26 20:57:38 + Time taken: 25.4 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF + Score (v2): 74.11 + Parseable: 171.0

Batch completed Time taken: 25.5 mins

Model AGIEval GPT4All TruthfulQA Bigbench Average
SOLAR-10.7b-Instruct-truthy-dpo 48.69 73.82 76.81 45.71 61.26

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.95 ± 2.82
acc_norm 27.95 ± 2.82
agieval_logiqa_en 0 acc 42.40 ± 1.94
acc_norm 42.24 ± 1.94
agieval_lsat_ar 0 acc 25.65 ± 2.89
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 54.12 ± 2.21
acc_norm 54.51 ± 2.21
agieval_lsat_rc 0 acc 69.89 ± 2.80
acc_norm 69.89 ± 2.80
agieval_sat_en 0 acc 80.10 ± 2.79
acc_norm 80.10 ± 2.79
agieval_sat_en_without_passage 0 acc 50.00 ± 3.49
acc_norm 49.51 ± 3.49
agieval_sat_math 0 acc 42.27 ± 3.34
acc_norm 41.36 ± 3.33

Average: 48.69%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 59.90 ± 1.43
acc_norm 63.91 ± 1.40
arc_easy 0 acc 80.85 ± 0.81
acc_norm 78.16 ± 0.85
boolq 1 acc 88.20 ± 0.56
hellaswag 0 acc 68.34 ± 0.46
acc_norm 86.39 ± 0.34
openbookqa 0 acc 37.60 ± 2.17
acc_norm 46.80 ± 2.23
piqa 0 acc 78.84 ± 0.95
acc_norm 78.78 ± 0.95
winogrande 0 acc 74.51 ± 1.22

Average: 73.82%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 61.81 ± 1.70
mc2 76.81 ± 1.42

Average: 76.81%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 50.53 ± 3.64
bigbench_date_understanding 0 multiple_choice_grade 63.14 ± 2.51
bigbench_disambiguation_qa 0 multiple_choice_grade 47.67 ± 3.12
bigbench_geometric_shapes 0 multiple_choice_grade 26.18 ± 2.32
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.60 ± 2.02
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 21.29 ± 1.55
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 47.33 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 39.80 ± 2.19
bigbench_navigate 0 multiple_choice_grade 63.80 ± 1.52
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 59.05 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 40.18 ± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 46.69 ± 1.58
bigbench_snarks 0 multiple_choice_grade 65.19 ± 3.55
bigbench_sports_understanding 0 multiple_choice_grade 72.41 ± 1.42
bigbench_temporal_sequences 0 multiple_choice_grade 60.30 ± 1.55
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 25.76 ± 1.24
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.43 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 47.33 ± 2.89

Average: 45.71%

Average score: 61.26%

Elapsed time: 02:16:03

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 74.11
AI2 Reasoning Challenge (25-Shot) 72.10
HellaSwag (10-Shot) 88.44
MMLU (5-Shot) 65.45
TruthfulQA (0-shot) 76.75
Winogrande (5-shot) 82.72
GSM8k (5-shot) 59.21
Downloads last month
357
Safetensors
Model size
10.7B params
Tensor type
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo 1

Collection including macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Evaluation results