| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average| |--------------------------------------------------------------|------:|------:|---------:|-------:|------:| |[gemma-2b-orpo](https://huggingface.co/anakin87/gemma-2b-orpo)| 23.76| 58.25| 44.47| 31.32| 39.45| ### AGIEval | Task |Version| Metric |Value| |Stderr| |------------------------------|------:|--------|----:|---|-----:| |agieval_aqua_rat | 0|acc |15.35|± | 2.27| | | |acc_norm|17.32|± | 2.38| |agieval_logiqa_en | 0|acc |25.96|± | 1.72| | | |acc_norm|29.34|± | 1.79| |agieval_lsat_ar | 0|acc |19.57|± | 2.62| | | |acc_norm|20.00|± | 2.64| |agieval_lsat_lr | 0|acc |23.14|± | 1.87| | | |acc_norm|21.96|± | 1.83| |agieval_lsat_rc | 0|acc |24.16|± | 2.61| | | |acc_norm|24.54|± | 2.63| |agieval_sat_en | 0|acc |29.61|± | 3.19| | | |acc_norm|27.18|± | 3.11| |agieval_sat_en_without_passage| 0|acc |30.58|± | 3.22| | | |acc_norm|24.76|± | 3.01| |agieval_sat_math | 0|acc |23.64|± | 2.87| | | |acc_norm|25.00|± | 2.93| Average: 23.76% ### GPT4All | Task |Version| Metric |Value| |Stderr| |-------------|------:|--------|----:|---|-----:| |arc_challenge| 0|acc |37.97|± | 1.42| | | |acc_norm|40.61|± | 1.44| |arc_easy | 0|acc |67.63|± | 0.96| | | |acc_norm|65.82|± | 0.97| |boolq | 1|acc |69.85|± | 0.80| |hellaswag | 0|acc |52.39|± | 0.50| | | |acc_norm|67.70|± | 0.47| |openbookqa | 0|acc |25.40|± | 1.95| | | |acc_norm|37.40|± | 2.17| |piqa | 0|acc |71.71|± | 1.05| | | |acc_norm|72.74|± | 1.04| |winogrande | 0|acc |53.59|± | 1.40| Average: 58.25% ### TruthfulQA | Task |Version|Metric|Value| |Stderr| |-------------|------:|------|----:|---|-----:| |truthfulqa_mc| 1|mc1 |28.76|± | 1.58| | | |mc2 |44.47|± | 1.61| Average: 44.47% ### Bigbench | Task |Version| Metric |Value| |Stderr| |------------------------------------------------|------:|---------------------|----:|---|-----:| |bigbench_causal_judgement | 0|multiple_choice_grade|51.58|± | 3.64| |bigbench_date_understanding | 0|multiple_choice_grade|43.63|± | 2.59| |bigbench_disambiguation_qa | 0|multiple_choice_grade|37.21|± | 3.02| |bigbench_geometric_shapes | 0|multiple_choice_grade|10.03|± | 1.59| | | |exact_str_match | 0.00|± | 0.00| |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|23.80|± | 1.91| |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|18.00|± | 1.45| |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|38.67|± | 2.82| |bigbench_movie_recommendation | 0|multiple_choice_grade|22.60|± | 1.87| |bigbench_navigate | 0|multiple_choice_grade|50.00|± | 1.58| |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|32.80|± | 1.05| |bigbench_ruin_names | 0|multiple_choice_grade|25.67|± | 2.07| |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|19.24|± | 1.25| |bigbench_snarks | 0|multiple_choice_grade|44.75|± | 3.71| |bigbench_sports_understanding | 0|multiple_choice_grade|49.70|± | 1.59| |bigbench_temporal_sequences | 0|multiple_choice_grade|24.60|± | 1.36| |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|19.20|± | 1.11| |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|13.60|± | 0.82| |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|38.67|± | 2.82| Average: 31.32% Average score: 39.45% Elapsed time: 02:46:40