Spaces:
Running
Running
Sanity check: How can we use the apps metric on the given reference solutions from the apps dataset?
#2
by
lhk
- opened
I would like to use the APPS metric but am not sure about the right format for generated data.
The APPS dataset has a list of reference solutions. As a sanity check, is it possible to verify that the reference solutions are recognized as (mostly) correct.
I've set up the following minimal example.
One strange issue: it seems that the metric expects a list(list(list(str))), 3 lists, while the documentation says it should be two nested lists, i.e. list(list(str)).
from datasets import load_dataset
from evaluate import load
import json
apps_metric = load('codeparrot/apps_metric')
apps_dataset = load_dataset("codeparrot/apps", split="test")
solutions = []
for entry in apps_dataset:
# non-empty solutions and input_output features can be parsed from text format this way:
if entry['solutions']:
parsed = json.loads(entry["solutions"])
else:
parsed = ""
# if I don't wrap the solutions in an additional list, the metric complains
# also, to speed up evaluation, I'm limiting the amount of reference solutions
solutions.append([parsed[:10]])
results = apps_metric.compute(predictions=solutions)
print(results)
This leads to the output:
number of compile errors = 1273 avg = 0.2546
number of runtime errors = 1 avg = 0.0002
number of problems evaluated = 5000
Average Accuracy : 0.001601382419694627
Strict Accuracy : 0.0008
{'avg_accuracy': 0.001601382419694627, 'strict_accuracy': 0.0008, 'pass_at_k': None}```
I've now found the example_script in the files :)
Please disregard my question, I think that script explains everything.
lhk
changed discussion status to
closed