dreamgen/llama3-8b-instruct-align-test2-kto

What is this? Nothing interesting, just an experiment.
License: CC-BY-NC

|                         Task                         |Version|    Metric    |Value |   |Stderr|
|------------------------------------------------------|------:|--------------|-----:|---|-----:|
|all                                                   |       |acc           |0.6502|±  |0.0327|
|                                                      |       |acc_norm      |0.6414|±  |0.0095|
|                                                      |       |truthfulqa_mc1|0.3696|±  |0.0169|
|                                                      |       |truthfulqa_mc2|0.5305|±  |0.0159|
|                                                      |       |qem           |0.4670|±  |0.0137|
|leaderboard:arc:challenge:25                          |      0|acc           |0.5555|±  |0.0145|
|                                                      |       |acc_norm      |0.5623|±  |0.0145|
|leaderboard:gsm8k:5                                   |      0|qem           |0.4670|±  |0.0137|
|leaderboard:hellaswag:10                              |      0|acc           |0.5598|±  |0.0050|
|                                                      |       |acc_norm      |0.7205|±  |0.0045|
|leaderboard:mmlu:_average:5                           |       |acc           |0.6527|±  |0.0338|
|leaderboard:mmlu:abstract_algebra:5                   |      0|acc           |0.3300|±  |0.0473|
|leaderboard:mmlu:anatomy:5                            |      0|acc           |0.6593|±  |0.0409|
|leaderboard:mmlu:astronomy:5                          |      0|acc           |0.7303|±  |0.0361|
|leaderboard:mmlu:business_ethics:5                    |      0|acc           |0.6700|±  |0.0473|
|leaderboard:mmlu:clinical_knowledge:5                 |      0|acc           |0.7321|±  |0.0273|
|leaderboard:mmlu:college_biology:5                    |      0|acc           |0.7708|±  |0.0351|
|leaderboard:mmlu:college_chemistry:5                  |      0|acc           |0.4900|±  |0.0502|
|leaderboard:mmlu:college_computer_science:5           |      0|acc           |0.4600|±  |0.0501|
|leaderboard:mmlu:college_mathematics:5                |      0|acc           |0.3900|±  |0.0490|
|leaderboard:mmlu:college_medicine:5                   |      0|acc           |0.6069|±  |0.0372|
|leaderboard:mmlu:college_physics:5                    |      0|acc           |0.4706|±  |0.0497|
|leaderboard:mmlu:computer_security:5                  |      0|acc           |0.7800|±  |0.0416|
|leaderboard:mmlu:conceptual_physics:5                 |      0|acc           |0.5830|±  |0.0322|
|leaderboard:mmlu:econometrics:5                       |      0|acc           |0.5000|±  |0.0470|
|leaderboard:mmlu:electrical_engineering:5             |      0|acc           |0.5862|±  |0.0410|
|leaderboard:mmlu:elementary_mathematics:5             |      0|acc           |0.4630|±  |0.0257|
|leaderboard:mmlu:formal_logic:5                       |      0|acc           |0.5238|±  |0.0447|
|leaderboard:mmlu:global_facts:5                       |      0|acc           |0.4300|±  |0.0498|
|leaderboard:mmlu:high_school_biology:5                |      0|acc           |0.7581|±  |0.0244|
|leaderboard:mmlu:high_school_chemistry:5              |      0|acc           |0.5271|±  |0.0351|
|leaderboard:mmlu:high_school_computer_science:5       |      0|acc           |0.6600|±  |0.0476|
|leaderboard:mmlu:high_school_european_history:5       |      0|acc           |0.7212|±  |0.0350|
|leaderboard:mmlu:high_school_geography:5              |      0|acc           |0.7929|±  |0.0289|
|leaderboard:mmlu:high_school_government_and_politics:5|      0|acc           |0.8756|±  |0.0238|
|leaderboard:mmlu:high_school_macroeconomics:5         |      0|acc           |0.6590|±  |0.0240|
|leaderboard:mmlu:high_school_mathematics:5            |      0|acc           |0.3407|±  |0.0289|
|leaderboard:mmlu:high_school_microeconomics:5         |      0|acc           |0.7563|±  |0.0279|
|leaderboard:mmlu:high_school_physics:5                |      0|acc           |0.4503|±  |0.0406|
|leaderboard:mmlu:high_school_psychology:5             |      0|acc           |0.8294|±  |0.0161|
|leaderboard:mmlu:high_school_statistics:5             |      0|acc           |0.4954|±  |0.0341|
|leaderboard:mmlu:high_school_us_history:5             |      0|acc           |0.8039|±  |0.0279|
|leaderboard:mmlu:high_school_world_history:5          |      0|acc           |0.8186|±  |0.0251|
|leaderboard:mmlu:human_aging:5                        |      0|acc           |0.6951|±  |0.0309|
|leaderboard:mmlu:human_sexuality:5                    |      0|acc           |0.7863|±  |0.0360|
|leaderboard:mmlu:international_law:5                  |      0|acc           |0.8017|±  |0.0364|
|leaderboard:mmlu:jurisprudence:5                      |      0|acc           |0.8056|±  |0.0383|
|leaderboard:mmlu:logical_fallacies:5                  |      0|acc           |0.7362|±  |0.0346|
|leaderboard:mmlu:machine_learning:5                   |      0|acc           |0.4911|±  |0.0475|
|leaderboard:mmlu:management:5                         |      0|acc           |0.8252|±  |0.0376|
|leaderboard:mmlu:marketing:5                          |      0|acc           |0.8718|±  |0.0219|
|leaderboard:mmlu:medical_genetics:5                   |      0|acc           |0.6900|±  |0.0465|
|leaderboard:mmlu:miscellaneous:5                      |      0|acc           |0.8225|±  |0.0137|
|leaderboard:mmlu:moral_disputes:5                     |      0|acc           |0.7052|±  |0.0245|
|leaderboard:mmlu:moral_scenarios:5                    |      0|acc           |0.4190|±  |0.0165|
|leaderboard:mmlu:nutrition:5                          |      0|acc           |0.7353|±  |0.0253|
|leaderboard:mmlu:philosophy:5                         |      0|acc           |0.7203|±  |0.0255|
|leaderboard:mmlu:prehistory:5                         |      0|acc           |0.6975|±  |0.0256|
|leaderboard:mmlu:professional_accounting:5            |      0|acc           |0.5035|±  |0.0298|
|leaderboard:mmlu:professional_law:5                   |      0|acc           |0.4576|±  |0.0127|
|leaderboard:mmlu:professional_medicine:5              |      0|acc           |0.7132|±  |0.0275|
|leaderboard:mmlu:professional_psychology:5            |      0|acc           |0.6879|±  |0.0187|
|leaderboard:mmlu:public_relations:5                   |      0|acc           |0.6545|±  |0.0455|
|leaderboard:mmlu:security_studies:5                   |      0|acc           |0.7388|±  |0.0281|
|leaderboard:mmlu:sociology:5                          |      0|acc           |0.8159|±  |0.0274|
|leaderboard:mmlu:us_foreign_policy:5                  |      0|acc           |0.8500|±  |0.0359|
|leaderboard:mmlu:virology:5                           |      0|acc           |0.5000|±  |0.0389|
|leaderboard:mmlu:world_religions:5                    |      0|acc           |0.8129|±  |0.0299|
|leaderboard:truthfulqa:mc:0                           |      0|truthfulqa_mc1|0.3696|±  |0.0169|
|                                                      |       |truthfulqa_mc2|0.5305|±  |0.0159|
|leaderboard:winogrande:5                              |      0|acc           |0.6938|±  |0.0130|

Baseline:

|                         Task                         |Version|    Metric    |Value |   |Stderr|
|------------------------------------------------------|------:|--------------|-----:|---|-----:|
|all                                                   |       |acc           |0.6635|±  |0.0322|
|                                                      |       |acc_norm      |0.6569|±  |0.0094|
|                                                      |       |truthfulqa_mc1|0.3745|±  |0.0169|
|                                                      |       |truthfulqa_mc2|0.5338|±  |0.0160|
|                                                      |       |qem           |0.6808|±  |0.0128|
|leaderboard:arc:challenge:25                          |      0|acc           |0.5742|±  |0.0144|
|                                                      |       |acc_norm      |0.5828|±  |0.0144|
|leaderboard:gsm8k:5                                   |      0|qem           |0.6808|±  |0.0128|
|leaderboard:hellaswag:10                              |      0|acc           |0.5707|±  |0.0049|
|                                                      |       |acc_norm      |0.7310|±  |0.0044|
|leaderboard:mmlu:_average:5                           |       |acc           |0.6662|±  |0.0333|
|leaderboard:mmlu:abstract_algebra:5                   |      0|acc           |0.3300|±  |0.0473|
|leaderboard:mmlu:anatomy:5                            |      0|acc           |0.6815|±  |0.0402|
|leaderboard:mmlu:astronomy:5                          |      0|acc           |0.7500|±  |0.0352|
|leaderboard:mmlu:business_ethics:5                    |      0|acc           |0.7000|±  |0.0461|
|leaderboard:mmlu:clinical_knowledge:5                 |      0|acc           |0.7472|±  |0.0267|
|leaderboard:mmlu:college_biology:5                    |      0|acc           |0.7917|±  |0.0340|
|leaderboard:mmlu:college_chemistry:5                  |      0|acc           |0.4500|±  |0.0500|
|leaderboard:mmlu:college_computer_science:5           |      0|acc           |0.5200|±  |0.0502|
|leaderboard:mmlu:college_mathematics:5                |      0|acc           |0.3900|±  |0.0490|
|leaderboard:mmlu:college_medicine:5                   |      0|acc           |0.6590|±  |0.0361|
|leaderboard:mmlu:college_physics:5                    |      0|acc           |0.4314|±  |0.0493|
|leaderboard:mmlu:computer_security:5                  |      0|acc           |0.7900|±  |0.0409|
|leaderboard:mmlu:conceptual_physics:5                 |      0|acc           |0.5872|±  |0.0322|
|leaderboard:mmlu:econometrics:5                       |      0|acc           |0.5439|±  |0.0469|
|leaderboard:mmlu:electrical_engineering:5             |      0|acc           |0.6138|±  |0.0406|
|leaderboard:mmlu:elementary_mathematics:5             |      0|acc           |0.4683|±  |0.0257|
|leaderboard:mmlu:formal_logic:5                       |      0|acc           |0.5317|±  |0.0446|
|leaderboard:mmlu:global_facts:5                       |      0|acc           |0.4600|±  |0.0501|
|leaderboard:mmlu:high_school_biology:5                |      0|acc           |0.8065|±  |0.0225|
|leaderboard:mmlu:high_school_chemistry:5              |      0|acc           |0.5419|±  |0.0351|
|leaderboard:mmlu:high_school_computer_science:5       |      0|acc           |0.6800|±  |0.0469|
|leaderboard:mmlu:high_school_european_history:5       |      0|acc           |0.7394|±  |0.0343|
|leaderboard:mmlu:high_school_geography:5              |      0|acc           |0.8131|±  |0.0278|
|leaderboard:mmlu:high_school_government_and_politics:5|      0|acc           |0.8964|±  |0.0220|
|leaderboard:mmlu:high_school_macroeconomics:5         |      0|acc           |0.6769|±  |0.0237|
|leaderboard:mmlu:high_school_mathematics:5            |      0|acc           |0.3259|±  |0.0286|
|leaderboard:mmlu:high_school_microeconomics:5         |      0|acc           |0.7563|±  |0.0279|
|leaderboard:mmlu:high_school_physics:5                |      0|acc           |0.4106|±  |0.0402|
|leaderboard:mmlu:high_school_psychology:5             |      0|acc           |0.8477|±  |0.0154|
|leaderboard:mmlu:high_school_statistics:5             |      0|acc           |0.4769|±  |0.0341|
|leaderboard:mmlu:high_school_us_history:5             |      0|acc           |0.7892|±  |0.0286|
|leaderboard:mmlu:high_school_world_history:5          |      0|acc           |0.8397|±  |0.0239|
|leaderboard:mmlu:human_aging:5                        |      0|acc           |0.7265|±  |0.0299|
|leaderboard:mmlu:human_sexuality:5                    |      0|acc           |0.7939|±  |0.0355|
|leaderboard:mmlu:international_law:5                  |      0|acc           |0.7686|±  |0.0385|
|leaderboard:mmlu:jurisprudence:5                      |      0|acc           |0.7593|±  |0.0413|
|leaderboard:mmlu:logical_fallacies:5                  |      0|acc           |0.7607|±  |0.0335|
|leaderboard:mmlu:machine_learning:5                   |      0|acc           |0.5268|±  |0.0474|
|leaderboard:mmlu:management:5                         |      0|acc           |0.8155|±  |0.0384|
|leaderboard:mmlu:marketing:5                          |      0|acc           |0.9060|±  |0.0191|
|leaderboard:mmlu:medical_genetics:5                   |      0|acc           |0.7900|±  |0.0409|
|leaderboard:mmlu:miscellaneous:5                      |      0|acc           |0.8238|±  |0.0136|
|leaderboard:mmlu:moral_disputes:5                     |      0|acc           |0.7399|±  |0.0236|
|leaderboard:mmlu:moral_scenarios:5                    |      0|acc           |0.4358|±  |0.0166|
|leaderboard:mmlu:nutrition:5                          |      0|acc           |0.7549|±  |0.0246|
|leaderboard:mmlu:philosophy:5                         |      0|acc           |0.7331|±  |0.0251|
|leaderboard:mmlu:prehistory:5                         |      0|acc           |0.7469|±  |0.0242|
|leaderboard:mmlu:professional_accounting:5            |      0|acc           |0.5177|±  |0.0298|
|leaderboard:mmlu:professional_law:5                   |      0|acc           |0.4648|±  |0.0127|
|leaderboard:mmlu:professional_medicine:5              |      0|acc           |0.7279|±  |0.0270|
|leaderboard:mmlu:professional_psychology:5            |      0|acc           |0.6928|±  |0.0187|
|leaderboard:mmlu:public_relations:5                   |      0|acc           |0.6636|±  |0.0453|
|leaderboard:mmlu:security_studies:5                   |      0|acc           |0.7306|±  |0.0284|
|leaderboard:mmlu:sociology:5                          |      0|acc           |0.8557|±  |0.0248|
|leaderboard:mmlu:us_foreign_policy:5                  |      0|acc           |0.8600|±  |0.0349|
|leaderboard:mmlu:virology:5                           |      0|acc           |0.5361|±  |0.0388|
|leaderboard:mmlu:world_religions:5                    |      0|acc           |0.7953|±  |0.0309|
|leaderboard:truthfulqa:mc:0                           |      0|truthfulqa_mc1|0.3745|±  |0.0169|
|                                                      |       |truthfulqa_mc2|0.5338|±  |0.0160|
|leaderboard:winogrande:5                              |      0|acc           |0.6930|±  |0.0130|

dreamgen
/

llama3-8b-instruct-align-test2-kto

Model tree for dreamgen/llama3-8b-instruct-align-test2-kto

Spaces using dreamgen/llama3-8b-instruct-align-test2-kto 6