bc_eval / README.md
gabeorlanski's picture
Fix
87449d6 unverified
metadata
title: BabelCode Eval
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  This metric implements the evaluation harness for datasets translated with the
  BabelCode framework as described in the paper "Measuring The Impact Of
  Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

Metric Card for bc_eval

Metric Description

This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

How to Use

  1. Generate predictions for BabelCode supported datasets
  2. Aggregate the predictions by their question.
  3. With the aggregated predictions for each question, add the question_info from the original BabelCode dataset.
  4. Run the metric on the predictions, languages, and question_infos.
  5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
predictions = []
languages = []
question_infos = []
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
for row in ds:
    languages.append(row['language'])
    question_infos.append(row['question_info'])
    # Replace this with however you generate and postprocess predictions.
    predictions.append(model.generate(row['signature_with_docstring']))
metric = evaluate.load("gabeorlanski/bc_eval")
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)

Inputs

  • predictions(List[List[str]]): The list of predictions for each question to execute.
  • languages(List[str]): The language to use for each question.
  • question_dicts(List[Dict]): The information for each question.
  • k(List[int]): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
  • num_workers(int): number of workers used to evaluate the candidate programs (Default: 4).
  • language_timeout(Dict[str,int]): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).

Output Values

The bc_eval metric outputs two things:

metrics: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as {LANGUAGE NAME}/{METRIC NAME}

results: a list of dictionaries with the results from each individual prediction.

Values from Popular Papers

PaLM-2 Performance on BC-HumanEval (pass@1 with greedy decoding):

Language PaLM 2-S* PaLM 540B PaLM-Coder-540B
C# 24.22 20.5 26.09
C++ 34.16 21.74 24.22
Go 19.25 13.66 21.12
Haskell 8.7 1.86 1.86
Java 31.06 20.5 25.47
JavaScript 32.3 23.6 29.81
Julia 16.77 2.48 4.35
Lua 26.09 19.25 24.84
PHP 26.09 18.63 25.47
Python 34.16 17.39 26.71
Rust 28.57 16.15 22.98
TypeScript 32.3 17.39 30.43

Examples

Full example with inputs that fail tests, time out, have an error, and pass.

Passing Example

import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)

metrics is:

{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}

results is:

[
    {
        "qid": 0,
        "idx": "0",
        "file_path": ".../tmpqt_p3dwn/0",
        "results": [
            {
                "return_code": 0,
                "runtime": 0.076369,
                "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n",
                "stderr": "",
                "timed_out": false,
            }
        ],
        "failed": false,
        "timed_out": false,
        "test_cases": {
            "0": "PASSED",
            "1": "PASSED",
            "2": "PASSED",
            "3": "PASSED",
            "4": "PASSED",
            "5": "PASSED",
            "6": "PASSED",
        },
        "outcome": "PASSED",
    }
]

Fails Test Example

import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = elem - elem2
                if distance < threshold:
                    return True
    return False"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)

metrics is:

{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}

results is:

[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]

Note that the individual test results are located in results.

Timeout Example

import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    time.sleep(100)
    """
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)

metrics is:

{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}

results is:

[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]

Error Example

import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
        "gabeorlanski/bc-humaneval", "Python", split="test"
    )
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    raise ValueError()
    """,
    """def add(a, b):
    return a+b"""
]]
metrics, results = metric.compute(
    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)

metrics is:

{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}

results is:

{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]

Limitations and Bias

This metric requires that the dataset be BabelCode compatible.

Citation

@article{orlanski2023measuring,
  title={Measuring The Impact Of Programming Language Distribution},
  author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
  journal={arXiv preprint arXiv:2302.01973},
  year={2023}
}