File size: 4,696 Bytes
3614e45
a1b354b
3e65f6d
 
a1b354b
 
 
3e65f6d
 
 
 
3614e45
a1b354b
3614e45
 
 
 
a1b354b
 
 
 
3e65f6d
a1b354b
 
3e65f6d
81d56da
3e65f6d
a1b354b
3e65f6d
a1b354b
9433903
a1b354b
3e65f6d
 
 
 
944d14c
a1b354b
3e65f6d
 
 
9433903
3e65f6d
 
9433903
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e65f6d
 
 
 
 
 
36cc098
3e65f6d
 
 
a1b354b
3e65f6d
a1b354b
3e65f6d
a1b354b
3e65f6d
a1b354b
 
3e65f6d
a1b354b
 
3e65f6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b354b
 
3e65f6d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: code_eval_outputs
datasets: 
- giulio98/xlcost-single-prompt
tags:
- evaluate
- metric
description:
- This metric implements the evaluation harness for the HumanEval problem solving dataset
  described in the paper "Evaluating Large Language Models Trained on Code" 
  (https://arxiv.org/abs/2107.03374). But instead of evaluating the assertions it compares the output of the generated codes with the expected output
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
---

# Metric Card for code_eval_outputs


## Metric Description
This metric is based on [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval) but instead of evaluating the functional correctness of the generated program through assertions in the form of unit tests, it compares the output of the generated program with the expected output, for more details please refer to [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval).

## How to Use
The Code Eval metric calculates how good are predictions given a set of references. Its arguments are:

`predictions`: a list of candidates to evaluate. Each candidate should be a list of strings with several code candidates to solve the problem.

`references`: a list with a **function call** for each prediction. Each **function call** should output a string in stdout.

`output`: a list of the expected output for each prediction.

`k`: number of code candidates to consider in the evaluation. The default value is `[1, 10, 100]`.

`num_workers`: the number of workers used to evaluate the candidate programs (The default value is `4`).

`timeout`: The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `30.0` (i.e. 30 seconds).

```python
from evaluate import load
code_eval_outputs = load("giulio98/code_eval_outputs")
references = ["if __name__ == \"__main__\":\n    print(add(2, 3))"]
expected_outputs = ["5"]
candidates = [["def add(a,b):\n    return a*b", "def add(a, b):\n    return a+b"]]
pass_at_k, results = code_eval_outputs.compute(references=references, predictions=candidates, output=expected_outputs, k=[1, 2])
print(pass_at_k)
print(results)
```

Output:
```python
{'pass@1': 0.5, 'pass@2': 1.0}
defaultdict(list,
            {0: [(0,
               {'task_id': 0,
                'passed': False,
                'result': 'not passed',
                'completion_id': 0}),
              (1,
               {'task_id': 0,
                'passed': True,
                'result': 'passed',
                'completion_id': 1})]})
```

N.B.
This metric exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Before running this metric and once you've taken the necessary precautions, you will need to set the `HF_ALLOW_CODE_EVAL` environment variable. Use it at your own risk:
```python
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
```

### Output Values

The Code Eval metric outputs two things:

`pass_at_k`: a dictionary with the pass rates for each k value defined in the arguments.

`results`: a dictionary with granular results of each unit test.

## Limitations and Bias
Refer to [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval)

## Citation
```bibtex
@misc{chen2021evaluating,
      title={Evaluating Large Language Models Trained on Code},
      author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
and William Saunders and Christopher Hesse and Andrew N. Carr \
and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
      year={2021},
      eprint={2107.03374},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

## Further References
Refer to [code_eval](https://huggingface.co/spaces/evaluate-metric/code_eval)