Spaces:

gabeorlanski
/

bc_eval

Runtime error

App Files Files Community

bc_eval / README.md

gabeorlanski

Fix

87449d6 unverified over 1 year ago

preview code

raw

history blame

10.2 kB

	---
	title: BabelCode Eval
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- metric
	description: >-
	This metric implements the evaluation harness for datasets translated with the
	BabelCode framework as described in the paper "Measuring The Impact Of
	Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
	---

	# Metric Card for bc_eval


	## Metric Description
	This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

	## How to Use
	1. Generate predictions for BabelCode supported datasets
	2. Aggregate the predictions by their question.
	3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset.
	4. Run the metric on the `predictions`, `languages`, and `question_infos`.
	5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.

	```Python
	import evaluate
	from datasets import load_dataset
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"
	predictions = []
	languages = []
	question_infos = []
	ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
	for row in ds:
	languages.append(row['language'])
	question_infos.append(row['question_info'])
	# Replace this with however you generate and postprocess predictions.
	predictions.append(model.generate(row['signature_with_docstring']))
	metric = evaluate.load("gabeorlanski/bc_eval")
	metrics, results = metric.compute(
	predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
	)
	```

	### Inputs
	* `predictions`(`List[List[str]]`): The list of predictions for each question to execute.
	* `languages`(`List[str]`): The language to use for each question.
	* `question_dicts`(`List[Dict]`): The information for each question.
	* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
	* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4).
	* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).

	### Output Values

	The `bc_eval` metric outputs two things:

	`metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}`

	`results`: a list of dictionaries with the results from each individual prediction.

	#### Values from Popular Papers
	[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding):

	\| Language \| PaLM 2-S* \| PaLM 540B \| PaLM-Coder-540B \|
	\|------------\|-----------\|-----------\|-----------------\|
	\| C# \| 24.22 \| 20.5 \| 26.09 \|
	\| C++ \| 34.16 \| 21.74 \| 24.22 \|
	\| Go \| 19.25 \| 13.66 \| 21.12 \|
	\| Haskell \| 8.7 \| 1.86 \| 1.86 \|
	\| Java \| 31.06 \| 20.5 \| 25.47 \|
	\| JavaScript \| 32.3 \| 23.6 \| 29.81 \|
	\| Julia \| 16.77 \| 2.48 \| 4.35 \|
	\| Lua \| 26.09 \| 19.25 \| 24.84 \|
	\| PHP \| 26.09 \| 18.63 \| 25.47 \|
	\| Python \| 34.16 \| 17.39 \| 26.71 \|
	\| Rust \| 28.57 \| 16.15 \| 22.98 \|
	\| TypeScript \| 32.3 \| 17.39 \| 30.43 \|


	### Examples
	Full example with inputs that fail tests, time out, have an error, and pass.

	#### Passing Example
	```Python
	import evaluate
	from datasets import load_dataset
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"
	ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
	example = ds[0]
	metric = evaluate.load("gabeorlanski/bc_eval")
	languages = ["Python"]
	question_infos = [example["question_info"]]
	predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
	for idx, elem in enumerate(numbers):
	for idx2, elem2 in enumerate(numbers):
	if idx != idx2:
	distance = abs(elem - elem2)
	if distance < threshold:
	return True
	return False"""
	]]
	metrics, results = metric.compute(
	predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
	)
	```
	`metrics` is:
	```
	{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
	```
	`results` is:
	```
	[
	{
	"qid": 0,
	"idx": "0",
	"file_path": ".../tmpqt_p3dwn/0",
	"results": [
	{
	"return_code": 0,
	"runtime": 0.076369,
	"stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n",
	"stderr": "",
	"timed_out": false,
	}
	],
	"failed": false,
	"timed_out": false,
	"test_cases": {
	"0": "PASSED",
	"1": "PASSED",
	"2": "PASSED",
	"3": "PASSED",
	"4": "PASSED",
	"5": "PASSED",
	"6": "PASSED",
	},
	"outcome": "PASSED",
	}
	]
	```


	#### Fails Test Example

	```python
	import evaluate
	from datasets import load_dataset
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"
	ds = load_dataset(
	"gabeorlanski/bc-humaneval", "Python", split="test"
	)
	example = ds[0]
	metric = evaluate.load("gabeorlanski/bc_eval")
	languages = ["Python"]
	question_infos = [example["question_info"]]
	predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
	for idx, elem in enumerate(numbers):
	for idx2, elem2 in enumerate(numbers):
	if idx != idx2:
	distance = elem - elem2
	if distance < threshold:
	return True
	return False"""
	]]
	metrics, results = metric.compute(
	predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
	)
	```

	`metrics` is:
	```
	{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
	```
	`results` is:
	```
	[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
	```

	Note that the individual test results are located in results.

	#### Timeout Example

	```python
	import evaluate
	from datasets import load_dataset
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"
	ds = load_dataset(
	"gabeorlanski/bc-humaneval", "Python", split="test"
	)
	example = ds[0]
	metric = evaluate.load("gabeorlanski/bc_eval")
	languages = ["Python"]
	question_infos = [example["question_info"]]
	predictions = [["""import time
	def has_close_elements(numbers: List[float], threshold: float) -> bool:
	time.sleep(100)
	"""
	]]
	metrics, results = metric.compute(
	predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
	)
	```

	`metrics` is:
	```
	{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
	```
	`results` is:
	```
	[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
	```

	#### Error Example

	```python
	import evaluate
	from datasets import load_dataset
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"
	ds = load_dataset(
	"gabeorlanski/bc-humaneval", "Python", split="test"
	)
	example = ds[0]
	metric = evaluate.load("gabeorlanski/bc_eval")
	languages = ["Python"]
	question_infos = [example["question_info"]]
	predictions = [["""import time
	def has_close_elements(numbers: List[float], threshold: float) -> bool:
	raise ValueError()
	""",
	"""def add(a, b):
	return a+b"""
	]]
	metrics, results = metric.compute(
	predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
	)
	```

	`metrics` is:
	```
	{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
	```
	`results` is:
	```[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"},
	{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
	```

	## Limitations and Bias
	This metric requires that the dataset be BabelCode compatible.

	## Citation
	```
	@article{orlanski2023measuring,
	title={Measuring The Impact Of Programming Language Distribution},
	author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
	journal={arXiv preprint arXiv:2302.01973},
	year={2023}
	}
	```