Cannot reproduce some MMLU task results of tiiuae/falcon-180B

#372
by cody-bosonai - opened

Hi there,

I'm trying to reproduce the leaderboard MMLU results of tiiuae/falcon-180B. Specifically, I'm referring to https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/tiiuae/falcon-180B/results_2023-08-31T01%3A32%3A36.577851.json. The weird thing is some tasks performed worse on my local but others were fine. Here is the side by side comparison of all MMLU tasks. Each row represents the acc_norm of a MMLU task and the table is sorted by absolute difference. Anyone has any idea about the potential reason? Thanks.

Task                                                 Ours    Public        Diff
-------------------------------------------------  ------  --------  ----------
hendrycksTest-high_school_us_history               0.25      0.8922  0.642157
hendrycksTest-high_school_european_history         0.2182    0.8121  0.593939
hendrycksTest-jurisprudence                        0.2593    0.8519  0.592593
hendrycksTest-high_school_computer_science         0.25      0.72    0.47
hendrycksTest-business_ethics                      0.3       0.76    0.46
hendrycksTest-electrical_engineering               0.2414    0.6345  0.393103
hendrycksTest-college_computer_science             0.26      0.58    0.32
hendrycksTest-high_school_world_history            0.7806    0.8861  0.105485
hendrycksTest-professional_medicine                0.6691    0.7463  0.0772059
hendrycksTest-college_mathematics                  0.3       0.37    0.07
hendrycksTest-sociology                            0.8259    0.8955  0.0696517
hendrycksTest-astronomy                            0.7237    0.7895  0.0657895
hendrycksTest-abstract_algebra                     0.25      0.31    0.06
hendrycksTest-high_school_chemistry                0.6059    0.5714  0.0344828
hendrycksTest-econometrics                         0.4386    0.4649  0.0263158
hendrycksTest-conceptual_physics                   0.6596    0.6851  0.0255319
hendrycksTest-security_studies                     0.7469    0.7714  0.0244898
hendrycksTest-clinical_knowledge                   0.717     0.7396  0.0226415
hendrycksTest-computer_security                    0.79      0.81    0.02
hendrycksTest-medical_genetics                     0.78      0.8     0.02
hendrycksTest-miscellaneous                        0.8685    0.8876  0.0191571
hendrycksTest-international_law                    0.8099    0.8264  0.0165289
hendrycksTest-high_school_biology                  0.8613    0.8452  0.016129
hendrycksTest-philosophy                           0.7878    0.8039  0.0160772
hendrycksTest-formal_logic                         0.4841    0.4683  0.015873
hendrycksTest-high_school_mathematics              0.363     0.3778  0.0148148
hendrycksTest-professional_accounting              0.5532    0.539   0.0141844
hendrycksTest-college_biology                      0.8125    0.7986  0.0138889
hendrycksTest-high_school_microeconomics           0.7605    0.7731  0.012605
hendrycksTest-elementary_mathematics               0.4921    0.4815  0.010582
hendrycksTest-college_chemistry                    0.49      0.5     0.01
hendrycksTest-global_facts                         0.49      0.5     0.01
hendrycksTest-us_foreign_policy                    0.91      0.92    0.01
hendrycksTest-college_physics                      0.3922    0.402   0.00980392
hendrycksTest-professional_law                     0.5352    0.545   0.00977836
hendrycksTest-management                           0.8447    0.835   0.00970874
hendrycksTest-high_school_statistics               0.6296    0.6204  0.00925926
hendrycksTest-human_aging                          0.8296    0.8206  0.00896861
hendrycksTest-machine_learning                     0.5625    0.5536  0.00892857
hendrycksTest-marketing                            0.9103    0.9188  0.00854701
hendrycksTest-anatomy                              0.637     0.6296  0.00740741
hendrycksTest-prehistory                           0.8117    0.8179  0.00617284
hendrycksTest-virology                             0.5482    0.5542  0.0060241
hendrycksTest-college_medicine                     0.6936    0.6994  0.00578035
hendrycksTest-high_school_psychology               0.9046    0.8991  0.00550459
hendrycksTest-high_school_government_and_politics  0.9482    0.943   0.00518135
hendrycksTest-moral_scenarios                      0.5285    0.533   0.00446927
hendrycksTest-nutrition                            0.7843    0.781   0.00326797
hendrycksTest-moral_disputes                       0.8006    0.8035  0.00289017
hendrycksTest-high_school_macroeconomics           0.7103    0.7128  0.0025641
hendrycksTest-high_school_geography                0.8636    0.8636  0
hendrycksTest-high_school_physics                  0.4305    0.4305  0
hendrycksTest-human_sexuality                      0.8702    0.8702  0
hendrycksTest-logical_fallacies                    0.7975    0.7975  0
hendrycksTest-professional_psychology              0.7565    0.7565  0
hendrycksTest-public_relations                     0.7364    0.7364  0
hendrycksTest-world_religions                      0.848     0.848   0

Resolved. There was a tokenizer issue on my end...

cody-bosonai changed discussion status to closed

Sign up or log in to comment