Sanity check: How can we use the apps metric on the given reference solutions from the apps dataset?

#2
by lhk - opened

I would like to use the APPS metric but am not sure about the right format for generated data.
The APPS dataset has a list of reference solutions. As a sanity check, is it possible to verify that the reference solutions are recognized as (mostly) correct.

I've set up the following minimal example.
One strange issue: it seems that the metric expects a list(list(list(str))), 3 lists, while the documentation says it should be two nested lists, i.e. list(list(str)).

from datasets import load_dataset
from evaluate import load
import json

apps_metric = load('codeparrot/apps_metric')
apps_dataset = load_dataset("codeparrot/apps", split="test")

solutions = []

for entry in apps_dataset:
    # non-empty solutions and input_output features can be parsed from text format this way:
    if entry['solutions']:
        parsed = json.loads(entry["solutions"])
    else:
        parsed = ""
    
    # if I don't wrap the solutions in an additional list, the metric complains
    # also, to speed up evaluation, I'm limiting the amount of reference solutions
    solutions.append([parsed[:10]])
    
results = apps_metric.compute(predictions=solutions)
print(results)

This leads to the output:

number of compile errors = 1273 avg = 0.2546
number of runtime errors = 1 avg = 0.0002
number of problems evaluated = 5000
Average Accuracy : 0.001601382419694627
Strict Accuracy : 0.0008
{'avg_accuracy': 0.001601382419694627, 'strict_accuracy': 0.0008, 'pass_at_k': None}```

I've now found the example_script in the files :)
Please disregard my question, I think that script explains everything.

lhk changed discussion status to closed

Sign up or log in to comment