---
title: BabelCode Eval
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  This metric implements the evaluation harness for datasets translated with the
  BabelCode framework as described in the paper "Measuring The Impact Of
  Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
---

# Metric Card for bc_eval


## Metric Description
This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

## How to Use
1. Generate predictions for BabelCode supported datasets
2. Aggregate the predictions by their question. 
3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset.
4. Run the metric on the `predictions`, `languages`, and `question_infos`.
5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.

```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
predictions = []
languages = []
question_infos = []
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
for row in ds:
    languages.append(row['language'])
    question_infos.append(row['question_info'])
    # Replace this with however you generate and postprocess predictions.
    predictions.append(model.generate(row['signature_with_docstring']))
metric = evaluate.load("gabeorlanski/bc_eval")
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

### Inputs
* `predictions`(`List[List[str]]`): The list of predictions for each question to execute.
* `languages`(`List[str]`): The language to use for each question.
* `question_dicts`(`List[Dict]`): The information for each question.
* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4).
* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).

### Output Values

The `bc_eval` metric outputs two things:

`metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}`

`results`: a list of dictionaries with the results from each individual prediction.

#### Values from Popular Papers
[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding):

| Language   | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B |
|------------|-----------|-----------|-----------------|
| C#         | 24.22     | 20.5      | **26.09**           |
| C++        | **34.16**     | 21.74     | 24.22           |
| Go         | 19.25     | 13.66     | **21.12**           |
| Haskell    | **8.7**       | 1.86      | 1.86            |
| Java       | **31.06**     | 20.5      | 25.47           |
| JavaScript | **32.3**      | 23.6      | 29.81           |
| Julia      | **16.77**     | 2.48      | 4.35            |
| Lua        | **26.09**     | 19.25     | 24.84           |
| PHP        | **26.09**     | 18.63     | 25.47           |
| Python     | **34.16**     | 17.39     | 26.71           |
| Rust       | **28.57**     | 16.15     | 22.98           |
| TypeScript | **32.3**      | 17.39     | 30.43           |


### Examples
Full example with inputs that fail tests, time out, have an error, and pass.

#### Passing Example
```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
```
`results` is:
```
[
    {
        "qid": 0,
        "idx": "0",
        "file_path": ".../tmpqt_p3dwn/0",
        "results": [
            {
                "return_code": 0,
                "runtime": 0.076369,
                "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n",
                "stderr": "",
                "timed_out": false,
            }
        ],
        "failed": false,
        "timed_out": false,
        "test_cases": {
            "0": "PASSED",
            "1": "PASSED",
            "2": "PASSED",
            "3": "PASSED",
            "4": "PASSED",
            "5": "PASSED",
            "6": "PASSED",
        },
        "outcome": "PASSED",
    }
]
```


#### Fails Test Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
```

Note that the individual test results are located in results.

#### Timeout Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    time.sleep(100)
    """
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
```

#### Error Example

```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    raise ValueError()
    """,
    """def add(a, b):
    return a+b"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```

`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"}, 
{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
```

## Limitations and Bias
This metric requires that the dataset be BabelCode compatible.

## Citation
```
@article{orlanski2023measuring,
  title={Measuring The Impact Of Programming Language Distribution},
  author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
  journal={arXiv preprint arXiv:2302.01973},
  year={2023}
}
```