loubnabnl HF staff commited on
Commit
78b2b7f
1 Parent(s): cbac22d
Files changed (1) hide show
  1. evaluation/intro.txt +19 -1
evaluation/intro.txt CHANGED
@@ -16,7 +16,25 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
16
  |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
17
  |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
18
 
19
- To better understand how pass@k metric works, we will illustrate it with some examples. We select two problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜 (110M) with the two problems below:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  #### Problem 1:
21
 
22
  ```python
 
16
  |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
17
  |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
18
 
19
+ We can load HumanEval dataset and pass@k metric from the hub:
20
+
21
+ ```python
22
+ human_eval = load_dataset("openai_humaneval")
23
+ code_eval_metric = load_metric("code_eval")
24
+ ```
25
+
26
+ We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:
27
+
28
+ ```python
29
+ from datasets import load_metric
30
+ test_cases = ["assert add(2,3)==5"]
31
+ candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
32
+ pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
33
+ print(pass_at_k)
34
+ {'pass@1': 0.5, 'pass@2': 1.0}
35
+ ```
36
+
37
+ To better understand how pass@k metric works, we will illustrate it with some concrete examples. We select two problems from the HumanEval dataset and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests of the two problems below:
38
  #### Problem 1:
39
 
40
  ```python