lvwerra HF staff commited on
Commit
e347d8a
1 Parent(s): f9591cd

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +110 -5
  2. app.py +6 -0
  3. competition_math.py +95 -0
  4. requirements.txt +4 -0
README.md CHANGED
@@ -1,12 +1,117 @@
1
  ---
2
- title: Competition_math
3
- emoji: 👀
4
- colorFrom: red
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Competition MATH
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for Competition MATH
16
+
17
+ ## Metric description
18
+
19
+ This metric is used to assess performance on the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math).
20
+
21
+ It first canonicalizes the inputs (e.g., converting `1/2` to `\\frac{1}{2}`) and then computes accuracy.
22
+
23
+ ## How to use
24
+
25
+ This metric takes two arguments:
26
+
27
+ `predictions`: a list of predictions to score. Each prediction is a string that contains natural language and LaTeX.
28
+
29
+ `references`: list of reference for each prediction. Each reference is a string that contains natural language and LaTeX.
30
+
31
+
32
+ ```python
33
+ >>> from evaluate import load
34
+ >>> math = load("competition_math")
35
+ >>> references = ["\\frac{1}{2}"]
36
+ >>> predictions = ["1/2"]
37
+ >>> results = math.compute(references=references, predictions=predictions)
38
+ ```
39
+
40
+ N.B. To be able to use Competition MATH, you need to install the `math_equivalence` dependency using `pip install git+https://github.com/hendrycks/math.git`.
41
+
42
+
43
+ ## Output values
44
+
45
+ This metric returns a dictionary that contains the [accuracy](https://huggingface.co/metrics/accuracy) after canonicalizing inputs, on a scale between 0.0 and 1.0.
46
+
47
+ ### Values from popular papers
48
+ The [original MATH dataset paper](https://arxiv.org/abs/2103.03874) reported accuracies ranging from 3.0% to 6.9% by different large language models.
49
+
50
+ More recent progress on the dataset can be found on the [dataset leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math).
51
+
52
+ ## Examples
53
+
54
+ Maximal values (full match):
55
+
56
+ ```python
57
+ >>> from evaluate import load
58
+ >>> math = load("competition_math")
59
+ >>> references = ["\\frac{1}{2}"]
60
+ >>> predictions = ["1/2"]
61
+ >>> results = math.compute(references=references, predictions=predictions)
62
+ >>> print(results)
63
+ {'accuracy': 1.0}
64
+ ```
65
+
66
+ Minimal values (no match):
67
+
68
+ ```python
69
+ >>> from evaluate import load
70
+ >>> math = load("competition_math")
71
+ >>> references = ["\\frac{1}{2}"]
72
+ >>> predictions = ["3/4"]
73
+ >>> results = math.compute(references=references, predictions=predictions)
74
+ >>> print(results)
75
+ {'accuracy': 0.0}
76
+ ```
77
+
78
+ Partial match:
79
+
80
+ ```python
81
+ >>> from evaluate import load
82
+ >>> math = load("competition_math")
83
+ >>> references = ["\\frac{1}{2}","\\frac{3}{4}"]
84
+ >>> predictions = ["1/5", "3/4"]
85
+ >>> results = math.compute(references=references, predictions=predictions)
86
+ >>> print(results)
87
+ {'accuracy': 0.5}
88
+ ```
89
+
90
+ ## Limitations and bias
91
+
92
+ This metric is limited to datasets with the same format as the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math), and is meant to evaluate the performance of large language models at solving mathematical problems.
93
+
94
+ N.B. The MATH dataset also assigns levels of difficulty to different problems, so disagregating model performance by difficulty level (similarly to what was done in the [original paper](https://arxiv.org/abs/2103.03874) can give a better indication of how a given model does on a given difficulty of math problem, compared to overall accuracy.
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @article{hendrycksmath2021,
100
+ title={Measuring Mathematical Problem Solving With the MATH Dataset},
101
+ author={Dan Hendrycks
102
+ and Collin Burns
103
+ and Saurav Kadavath
104
+ and Akul Arora
105
+ and Steven Basart
106
+ and Eric Tang
107
+ and Dawn Song
108
+ and Jacob Steinhardt},
109
+ journal={arXiv preprint arXiv:2103.03874},
110
+ year={2021}
111
+ }
112
+ ```
113
+
114
+ ## Further References
115
+ - [MATH dataset](https://huggingface.co/datasets/competition_math)
116
+ - [MATH leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math)
117
+ - [MATH paper](https://arxiv.org/abs/2103.03874)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("competition_math")
6
+ launch_gradio_widget(module)
competition_math.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """Accuracy metric for the Mathematics Aptitude Test of Heuristics (MATH) dataset."""
15
+
16
+ import datasets
17
+ import math_equivalence # From: git+https://github.com/hendrycks/math.git
18
+
19
+ import evaluate
20
+
21
+
22
+ _CITATION = """\
23
+ @article{hendrycksmath2021,
24
+ title={Measuring Mathematical Problem Solving With the MATH Dataset},
25
+ author={Dan Hendrycks
26
+ and Collin Burns
27
+ and Saurav Kadavath
28
+ and Akul Arora
29
+ and Steven Basart
30
+ and Eric Tang
31
+ and Dawn Song
32
+ and Jacob Steinhardt},
33
+ journal={arXiv preprint arXiv:2103.03874},
34
+ year={2021}
35
+ }
36
+ """
37
+
38
+
39
+ _DESCRIPTION = """\
40
+ This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset.
41
+ It first canonicalizes the inputs (e.g., converting "1/2" to "\\frac{1}{2}") and then computes accuracy.
42
+ """
43
+
44
+
45
+ _KWARGS_DESCRIPTION = r"""
46
+ Calculates accuracy after canonicalizing inputs.
47
+
48
+ Args:
49
+ predictions: list of predictions to score. Each prediction
50
+ is a string that contains natural language and LaTex.
51
+ references: list of reference for each prediction. Each
52
+ reference is a string that contains natural language
53
+ and LaTex.
54
+ Returns:
55
+ accuracy: accuracy after canonicalizing inputs
56
+ (e.g., converting "1/2" to "\\frac{1}{2}")
57
+
58
+ Examples:
59
+ >>> metric = evaluate.load("competition_math")
60
+ >>> results = metric.compute(references=["\\frac{1}{2}"], predictions=["1/2"])
61
+ >>> print(results)
62
+ {'accuracy': 1.0}
63
+ """
64
+
65
+
66
+ @datasets.utils.file_utils.add_end_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
67
+ class CompetitionMathMetric(evaluate.EvaluationModule):
68
+ """Accuracy metric for the MATH dataset."""
69
+
70
+ def _info(self):
71
+ return evaluate.EvaluationModuleInfo(
72
+ description=_DESCRIPTION,
73
+ citation=_CITATION,
74
+ inputs_description=_KWARGS_DESCRIPTION,
75
+ features=datasets.Features(
76
+ {
77
+ "predictions": datasets.Value("string"),
78
+ "references": datasets.Value("string"),
79
+ }
80
+ ),
81
+ # Homepage of the metric for documentation
82
+ homepage="https://github.com/hendrycks/math",
83
+ # Additional links to the codebase or references
84
+ codebase_urls=["https://github.com/hendrycks/math"],
85
+ )
86
+
87
+ def _compute(self, predictions, references):
88
+ """Returns the scores"""
89
+ n_correct = 0.0
90
+ for i, j in zip(predictions, references):
91
+ n_correct += 1.0 if math_equivalence.is_equiv(i, j) else 0.0
92
+ accuracy = n_correct / len(predictions)
93
+ return {
94
+ "accuracy": accuracy,
95
+ }
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ git+https://github.com/hendrycks/math.git