File size: 3,222 Bytes
dfca22b
2d95b94
dfca22b
 
2d95b94
 
dfca22b
 
2d95b94
dfca22b
 
 
 
 
 
 
 
 
 
 
 
cb05bec
dfca22b
cb05bec
dfca22b
 
 
 
 
 
 
 
 
 
 
 
 
 
cb05bec
dfca22b
 
 
2116c96
dfca22b
 
cb05bec
dfca22b
 
 
cb05bec
dfca22b
 
 
cb05bec
835e9c6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

We can load HumanEval dataset and pass@k metric from 🤗 [`datasets`](https://huggingface.co/docs/datasets/index) and 🤗 [`evaluate`](https://huggingface.co/docs/evaluate/index)

```python
from datasets import load_dataset
from evaluate import load

human_eval = load_dataset("openai_humaneval")
code_eval_metric = load("code_eval")
```

We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:

```python
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
print(pass_at_k)
{'pass@1': 0.5, 'pass@2': 1.0}
```

To better understand how pass@k metric works, we will illustrate it with a concrete example from HumanEval dataset. We select the problem below and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests:

**Problem:**

```python

def truncate_number(number: float) -> float:
    """ Given a positive floating point number, it can be decomposed into
    and integer part (largest integer smaller than given number) and decimals
    (leftover part always smaller than 1).

    Return the decimal part of the number.
    >>> truncate_number(3.5)
    0.5
    """
````

Instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).

**Remark**:

Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program. 


For our experiment, we compute pass@1, pass@10 and pass@20, each corresponding to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.

```

Results: {'pass@1': 0.1, 'pass@10': 0.7631, 'pass@20': 1.0}

````

If we take a closer look at the unit test results for each candidate solution, we find that 2 passed the unit test. This means that we have 2 correct solutions among 20, which corresponds to our pass@1 value `2/20 = 0.1`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
for pass@20, it is `1`, since if we select all 20 candidates the problem gets solved which gives 100% success rate.