gemma-2b-orpo / assets /gemma-2b-orpo-Nous.md
anakin87's picture
material
4db7146
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|--------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|[gemma-2b-orpo](https://huggingface.co/anakin87/gemma-2b-orpo)| 23.76| 58.25| 44.47| 31.32| 39.45|
### AGIEval
| Task |Version| Metric |Value| |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat | 0|acc |15.35|± | 2.27|
| | |acc_norm|17.32|± | 2.38|
|agieval_logiqa_en | 0|acc |25.96|± | 1.72|
| | |acc_norm|29.34|± | 1.79|
|agieval_lsat_ar | 0|acc |19.57|± | 2.62|
| | |acc_norm|20.00|± | 2.64|
|agieval_lsat_lr | 0|acc |23.14|± | 1.87|
| | |acc_norm|21.96|± | 1.83|
|agieval_lsat_rc | 0|acc |24.16|± | 2.61|
| | |acc_norm|24.54|± | 2.63|
|agieval_sat_en | 0|acc |29.61|± | 3.19|
| | |acc_norm|27.18|± | 3.11|
|agieval_sat_en_without_passage| 0|acc |30.58|± | 3.22|
| | |acc_norm|24.76|± | 3.01|
|agieval_sat_math | 0|acc |23.64|± | 2.87|
| | |acc_norm|25.00|± | 2.93|
Average: 23.76%
### GPT4All
| Task |Version| Metric |Value| |Stderr|
|-------------|------:|--------|----:|---|-----:|
|arc_challenge| 0|acc |37.97|± | 1.42|
| | |acc_norm|40.61|± | 1.44|
|arc_easy | 0|acc |67.63|± | 0.96|
| | |acc_norm|65.82|± | 0.97|
|boolq | 1|acc |69.85|± | 0.80|
|hellaswag | 0|acc |52.39|± | 0.50|
| | |acc_norm|67.70|± | 0.47|
|openbookqa | 0|acc |25.40|± | 1.95|
| | |acc_norm|37.40|± | 2.17|
|piqa | 0|acc |71.71|± | 1.05|
| | |acc_norm|72.74|± | 1.04|
|winogrande | 0|acc |53.59|± | 1.40|
Average: 58.25%
### TruthfulQA
| Task |Version|Metric|Value| |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc| 1|mc1 |28.76|± | 1.58|
| | |mc2 |44.47|± | 1.61|
Average: 44.47%
### Bigbench
| Task |Version| Metric |Value| |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement | 0|multiple_choice_grade|51.58|± | 3.64|
|bigbench_date_understanding | 0|multiple_choice_grade|43.63|± | 2.59|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|37.21|± | 3.02|
|bigbench_geometric_shapes | 0|multiple_choice_grade|10.03|± | 1.59|
| | |exact_str_match | 0.00|± | 0.00|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|23.80|± | 1.91|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|18.00|± | 1.45|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|38.67|± | 2.82|
|bigbench_movie_recommendation | 0|multiple_choice_grade|22.60|± | 1.87|
|bigbench_navigate | 0|multiple_choice_grade|50.00|± | 1.58|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|32.80|± | 1.05|
|bigbench_ruin_names | 0|multiple_choice_grade|25.67|± | 2.07|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|19.24|± | 1.25|
|bigbench_snarks | 0|multiple_choice_grade|44.75|± | 3.71|
|bigbench_sports_understanding | 0|multiple_choice_grade|49.70|± | 1.59|
|bigbench_temporal_sequences | 0|multiple_choice_grade|24.60|± | 1.36|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|19.20|± | 1.11|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|13.60|± | 0.82|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|38.67|± | 2.82|
Average: 31.32%
Average score: 39.45%
Elapsed time: 02:46:40