Muennighoff commited on
Commit
87bb18f
·
1 Parent(s): 250a18a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -52
README.md CHANGED
@@ -11,10 +11,8 @@ tags:
11
  - evaluate
12
  - metric
13
  description: >-
14
- This metric implements the evaluation harness for the HumanEval problem
15
- solving dataset described in the paper "Evaluating Large Language Models
16
- Trained on Code" (https://arxiv.org/abs/2107.03374).
17
- duplicated_from: evaluate-metric/code_eval
18
  ---
19
 
20
  # Metric Card for Code Eval
@@ -23,7 +21,7 @@ duplicated_from: evaluate-metric/code_eval
23
 
24
  The CodeEval metric estimates the pass@k metric for code synthesis.
25
 
26
- It implements the evaluation harness for the HumanEval problem solving dataset described in the paper ["Evaluating Large Language Models Trained on Code"](https://arxiv.org/abs/2107.03374).
27
 
28
 
29
  ## How to use
@@ -40,12 +38,16 @@ The Code Eval metric calculates how good are predictions given a set of referenc
40
 
41
  `timeout`: The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds).
42
 
 
 
 
 
43
  ```python
44
  from evaluate import load
45
- code_eval = load("code_eval")
46
  test_cases = ["assert add(2,3)==5"]
47
  candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
48
- pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])
49
  ```
50
 
51
  N.B.
@@ -63,21 +65,16 @@ The Code Eval metric outputs two things:
63
 
64
  `results`: a dictionary with granular results of each unit test.
65
 
66
- ### Values from popular papers
67
- The [original CODEX paper](https://arxiv.org/pdf/2107.03374.pdf) reported that the CODEX-12B model had a pass@k score of 28.8% at `k=1`, 46.8% at `k=10` and 72.3% at `k=100`. However, since the CODEX model is not open source, it is hard to verify these numbers.
68
-
69
-
70
-
71
  ## Examples
72
 
73
  Full match at `k=1`:
74
 
75
  ```python
76
  from evaluate import load
77
- code_eval = load("code_eval")
78
  test_cases = ["assert add(2,3)==5"]
79
  candidates = [["def add(a, b): return a+b"]]
80
- pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
81
  print(pass_at_k)
82
  {'pass@1': 1.0}
83
  ```
@@ -86,10 +83,10 @@ No match for k = 1:
86
 
87
  ```python
88
  from evaluate import load
89
- code_eval = load("code_eval")
90
  test_cases = ["assert add(2,3)==5"]
91
  candidates = [["def add(a,b): return a*b"]]
92
- pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
93
  print(pass_at_k)
94
  {'pass@1': 0.0}
95
  ```
@@ -98,50 +95,21 @@ Partial match at k=1, full match at k=2:
98
 
99
  ```python
100
  from evaluate import load
101
- code_eval = load("code_eval")
102
  test_cases = ["assert add(2,3)==5"]
103
  candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]]
104
- pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])
105
  print(pass_at_k)
106
  {'pass@1': 0.5, 'pass@2': 1.0}
107
  ```
108
 
109
- ## Limitations and bias
110
-
111
- As per the warning included in the metric code itself:
112
- > This program exists to execute untrusted model-generated code. Although it is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, model-generated code may act destructively due to a lack of model capability or alignment. Users are strongly encouraged to sandbox this evaluation suite so that it does not perform destructive actions on their host or network. For more information on how OpenAI sandboxes its code, see the accompanying paper. Once you have read this disclaimer and taken appropriate precautions, uncomment the following line and proceed at your own risk:
113
-
114
- More information about the limitations of the code can be found on the [Human Eval Github repository](https://github.com/openai/human-eval).
115
-
116
  ## Citation
117
 
118
  ```bibtex
119
- @misc{chen2021evaluating,
120
- title={Evaluating Large Language Models Trained on Code},
121
- author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
122
- and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
123
- and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
124
- and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
125
- and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
126
- and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
127
- and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
128
- and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
129
- and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
130
- and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
131
- and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
132
- and William Saunders and Christopher Hesse and Andrew N. Carr \
133
- and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
134
- and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
135
- and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
136
- and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
137
- year={2021},
138
- eprint={2107.03374},
139
- archivePrefix={arXiv},
140
- primaryClass={cs.LG}
141
  }
142
  ```
143
-
144
- ## Further References
145
-
146
- - [Human Eval Github repository](https://github.com/openai/human-eval)
147
- - [OpenAI Codex website](https://openai.com/blog/openai-codex/)
 
11
  - evaluate
12
  - metric
13
  description: >-
14
+ This metric implements code evaluation with execution across multiple languages as used in the paper "OctoPack: Instruction Tuning
15
+ Code Large Language Models" (https://arxiv.org/abs/2308.07124).
 
 
16
  ---
17
 
18
  # Metric Card for Code Eval
 
21
 
22
  The CodeEval metric estimates the pass@k metric for code synthesis.
23
 
24
+ It implements the code exection for HumanEvalPack as described in the paper ["OctoPack: Instruction Tuning Code Large Language Model"](https://arxiv.org/abs/2308.07124).
25
 
26
 
27
  ## How to use
 
38
 
39
  `timeout`: The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds).
40
 
41
+ `language`: Which language to execute the code in. The default value is `python` and alternatives are `javascript`, `java`, `go`, `cpp`, `rust`
42
+
43
+ `cargo_string`: The cargo installations to perform for Rust. Defaults to some basic packages, see `code_eval_octopack.py`.
44
+
45
  ```python
46
  from evaluate import load
47
+ code_eval = load("Muennighoff/code_eval_octopack")
48
  test_cases = ["assert add(2,3)==5"]
49
  candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
50
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python")
51
  ```
52
 
53
  N.B.
 
65
 
66
  `results`: a dictionary with granular results of each unit test.
67
 
 
 
 
 
 
68
  ## Examples
69
 
70
  Full match at `k=1`:
71
 
72
  ```python
73
  from evaluate import load
74
+ code_eval = load("Muennighoff/code_eval_octopack")
75
  test_cases = ["assert add(2,3)==5"]
76
  candidates = [["def add(a, b): return a+b"]]
77
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python")
78
  print(pass_at_k)
79
  {'pass@1': 1.0}
80
  ```
 
83
 
84
  ```python
85
  from evaluate import load
86
+ code_eval = load("Muennighoff/code_eval_octopack")
87
  test_cases = ["assert add(2,3)==5"]
88
  candidates = [["def add(a,b): return a*b"]]
89
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python")
90
  print(pass_at_k)
91
  {'pass@1': 0.0}
92
  ```
 
95
 
96
  ```python
97
  from evaluate import load
98
+ code_eval = load("Muennighoff/code_eval_octopack")
99
  test_cases = ["assert add(2,3)==5"]
100
  candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]]
101
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python")
102
  print(pass_at_k)
103
  {'pass@1': 0.5, 'pass@2': 1.0}
104
  ```
105
 
 
 
 
 
 
 
 
106
  ## Citation
107
 
108
  ```bibtex
109
+ @article{muennighoff2023octopack,
110
+ title={OctoPack: Instruction Tuning Code Large Language Models},
111
+ author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre},
112
+ journal={arXiv preprint arXiv:2308.07124},
113
+ year={2023}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  }
115
  ```