File size: 10,173 Bytes
9c8145f
1359055
 
9c8145f
 
1359055
9c8145f
 
1359055
 
 
 
 
 
 
9c8145f
 
1359055
 
 
 
 
 
 
 
 
 
 
 
 
4879dbb
1359055
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9610edf
1359055
9610edf
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9610edf
1359055
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419ab80
1359055
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
title: BabelCode Eval
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  This metric implements the evaluation harness for datasets translated with the
  BabelCode framework as described in the paper "Measuring The Impact Of
  Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
---

# Metric Card for bc_eval


## Metric Description
This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

## How to Use
1. Generate predictions for BabelCode supported datasets
2. Aggregate the predictions by their question. 
3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset.
4. Run the metric on the `predictions`, `languages`, and `question_infos`.
5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.

```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
predictions = []
languages = []
question_infos = []
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
for row in ds:
    languages.append(row['language'])
    question_infos.append(row['question_info'])
    # Replace this with however you generate and postprocess predictions.
    predictions.append(model.generate(row['signature_with_docstring']))
metric = evaluate.load("gabeorlanski/bc_eval")
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

### Inputs
* `predictions`(`List[List[str]]`): The list of predictions for each question to execute.
* `languages`(`List[str]`): The language to use for each question.
* `question_dicts`(`List[Dict]`): The information for each question.
* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4).
* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).

### Output Values

The `bc_eval` metric outputs two things:

`metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}`

`results`: a list of dictionaries with the results from each individual prediction.

#### Values from Popular Papers
[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding):

| Language   | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B |
|------------|-----------|-----------|-----------------|
| C#         | 24.22     | 20.5      | **26.09**           |
| C++        | **34.16**     | 21.74     | 24.22           |
| Go         | 19.25     | 13.66     | **21.12**           |
| Haskell    | **8.7**       | 1.86      | 1.86            |
| Java       | **31.06**     | 20.5      | 25.47           |
| JavaScript | **32.3**      | 23.6      | 29.81           |
| Julia      | **16.77**     | 2.48      | 4.35            |
| Lua        | **26.09**     | 19.25     | 24.84           |
| PHP        | **26.09**     | 18.63     | 25.47           |
| Python     | **34.16**     | 17.39     | 26.71           |
| Rust       | **28.57**     | 16.15     | 22.98           |
| TypeScript | **32.3**      | 17.39     | 30.43           |


### Examples
Full example with inputs that fail tests, time out, have an error, and pass.

#### Passing Example
```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
```
`results` is:
```
[
    {
        "qid": 0,
        "idx": "0",
        "file_path": ".../tmpqt_p3dwn/0",
        "results": [
            {
                "return_code": 0,
                "runtime": 0.076369,
                "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n",
                "stderr": "",
                "timed_out": false,
            }
        ],
        "failed": false,
        "timed_out": false,
        "test_cases": {
            "0": "PASSED",
            "1": "PASSED",
            "2": "PASSED",
            "3": "PASSED",
            "4": "PASSED",
            "5": "PASSED",
            "6": "PASSED",
        },
        "outcome": "PASSED",
    }
]
```


#### Fails Test Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
```

Note that the individual test results are located in results.

#### Timeout Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    time.sleep(100)
    """
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
```

#### Error Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    raise ValueError()
    """,
    """def add(a, b):
    return a+b"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"}, 
{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
```

## Limitations and Bias
This metric requires that the dataset be BabelCode compatible.

## Citation
```
@article{orlanski2023measuring,
  title={Measuring The Impact Of Programming Language Distribution},
  author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
  journal={arXiv preprint arXiv:2302.01973},
  year={2023}
}
```