Clémentine commited on
Commit
698f471
1 Parent(s): ead4c96

removed drop

Browse files
Files changed (2) hide show
  1. src/display/about.py +4 -20
  2. src/display/utils.py +0 -4
src/display/about.py CHANGED
@@ -36,7 +36,6 @@ If there is no icon, we have not uploaded the information on the model yet, feel
36
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
37
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
38
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
39
- - <a href="https://arxiv.org/abs/1903.00161" target="_blank"> DROP </a> (3-shot) - English reading comprehension benchmark requiring Discrete Reasoning Over the content of Paragraphs.
40
 
41
  For all these evaluations, a higher score is a better score.
42
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
@@ -49,10 +48,10 @@ You can find:
49
 
50
  ## Reproducibility
51
  To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
52
- `python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
53
- ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=2 --output_path=<output_path>`
54
 
55
- The total batch size we get for models which fit on one A100 node is 16 (8 GPUs * 2). If you don't use parallelism, adapt your batch size to fit.
56
  *You can expect results to vary slightly for different batch sizes because of padding.*
57
 
58
  The tasks and few shots parameters are:
@@ -62,11 +61,9 @@ The tasks and few shots parameters are:
62
  - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
63
  - Winogrande: 5-shot, *winogrande* (`acc`)
64
  - GSM8k: 5-shot, *gsm8k* (`acc`)
65
- - DROP: 3-shot, *drop* (`f1`)
66
 
67
  Side note on the baseline scores:
68
  - for log-likelihood evaluation, we select the random baseline
69
- - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
70
  - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
71
 
72
  ## Quantization
@@ -203,17 +200,4 @@ CITATION_BUTTON_TEXT = r"""
203
  archivePrefix={arXiv},
204
  primaryClass={cs.CL}
205
  }
206
- @misc{DBLP:journals/corr/abs-1903-00161,
207
- title={{DROP:} {A} Reading Comprehension Benchmark Requiring Discrete Reasoning
208
- Over Paragraphs},
209
- author={Dheeru Dua and
210
- Yizhong Wang and
211
- Pradeep Dasigi and
212
- Gabriel Stanovsky and
213
- Sameer Singh and
214
- Matt Gardner},
215
- year={2019},
216
- eprinttype={arXiv},
217
- eprint={1903.00161},
218
- primaryClass={cs.CL}
219
- }"""
 
36
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
37
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
38
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
 
39
 
40
  For all these evaluations, a higher score is a better score.
41
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 
48
 
49
  ## Reproducibility
50
  To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
51
+ `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
52
+ ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
53
 
54
+ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
55
  *You can expect results to vary slightly for different batch sizes because of padding.*
56
 
57
  The tasks and few shots parameters are:
 
61
  - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
62
  - Winogrande: 5-shot, *winogrande* (`acc`)
63
  - GSM8k: 5-shot, *gsm8k* (`acc`)
 
64
 
65
  Side note on the baseline scores:
66
  - for log-likelihood evaluation, we select the random baseline
 
67
  - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
68
 
69
  ## Quantization
 
200
  archivePrefix={arXiv},
201
  primaryClass={cs.CL}
202
  }
203
+ """
 
 
 
 
 
 
 
 
 
 
 
 
 
src/display/utils.py CHANGED
@@ -20,7 +20,6 @@ class Tasks(Enum):
20
  truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
21
  winogrande = Task("winogrande", "acc", "Winogrande")
22
  gsm8k = Task("gsm8k", "acc", "GSM8K")
23
- drop = Task("drop", "f1", "DROP")
24
 
25
  # These classes are for user facing column names,
26
  # to avoid having to change them all around the code
@@ -79,7 +78,6 @@ baseline_row = {
79
  AutoEvalColumn.truthfulqa.name: 25.0,
80
  AutoEvalColumn.winogrande.name: 50.0,
81
  AutoEvalColumn.gsm8k.name: 0.21,
82
- AutoEvalColumn.drop.name: 0.47,
83
  AutoEvalColumn.dummy.name: "baseline",
84
  AutoEvalColumn.model_type.name: "",
85
  }
@@ -89,7 +87,6 @@ baseline_row = {
89
  # HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
90
  # MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
91
  # TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
92
- # Drop: https://leaderboard.allenai.org/drop/submissions/public
93
  # Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
94
  # GSM8K: paper
95
  # Define the human baselines
@@ -104,7 +101,6 @@ human_baseline_row = {
104
  AutoEvalColumn.truthfulqa.name: 94.0,
105
  AutoEvalColumn.winogrande.name: 94.0,
106
  AutoEvalColumn.gsm8k.name: 100,
107
- AutoEvalColumn.drop.name: 96.42,
108
  AutoEvalColumn.dummy.name: "human_baseline",
109
  AutoEvalColumn.model_type.name: "",
110
  }
 
20
  truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
21
  winogrande = Task("winogrande", "acc", "Winogrande")
22
  gsm8k = Task("gsm8k", "acc", "GSM8K")
 
23
 
24
  # These classes are for user facing column names,
25
  # to avoid having to change them all around the code
 
78
  AutoEvalColumn.truthfulqa.name: 25.0,
79
  AutoEvalColumn.winogrande.name: 50.0,
80
  AutoEvalColumn.gsm8k.name: 0.21,
 
81
  AutoEvalColumn.dummy.name: "baseline",
82
  AutoEvalColumn.model_type.name: "",
83
  }
 
87
  # HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
88
  # MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
89
  # TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
 
90
  # Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
91
  # GSM8K: paper
92
  # Define the human baselines
 
101
  AutoEvalColumn.truthfulqa.name: 94.0,
102
  AutoEvalColumn.winogrande.name: 94.0,
103
  AutoEvalColumn.gsm8k.name: 100,
 
104
  AutoEvalColumn.dummy.name: "human_baseline",
105
  AutoEvalColumn.model_type.name: "",
106
  }