Spaces:

codeparrot
/

apps_metric

Running

Sanity check: How can we use the apps metric on the given reference solutions from the apps dataset?

by lhk - opened Oct 20, 2022

lhk

Oct 20, 2022

I would like to use the APPS metric but am not sure about the right format for generated data.
The APPS dataset has a list of reference solutions. As a sanity check, is it possible to verify that the reference solutions are recognized as (mostly) correct.

I've set up the following minimal example.
One strange issue: it seems that the metric expects a list(list(list(str))), 3 lists, while the documentation says it should be two nested lists, i.e. list(list(str)).

from datasets import load_dataset
from evaluate import load
import json

apps_metric = load('codeparrot/apps_metric')
apps_dataset = load_dataset("codeparrot/apps", split="test")

solutions = []

for entry in apps_dataset:
    # non-empty solutions and input_output features can be parsed from text format this way:
    if entry['solutions']:
        parsed = json.loads(entry["solutions"])
    else:
        parsed = ""
    
    # if I don't wrap the solutions in an additional list, the metric complains
    # also, to speed up evaluation, I'm limiting the amount of reference solutions
    solutions.append([parsed[:10]])
    
results = apps_metric.compute(predictions=solutions)
print(results)

This leads to the output:

number of compile errors = 1273 avg = 0.2546
number of runtime errors = 1 avg = 0.0002
number of problems evaluated = 5000
Average Accuracy : 0.001601382419694627
Strict Accuracy : 0.0008
{'avg_accuracy': 0.001601382419694627, 'strict_accuracy': 0.0008, 'pass_at_k': None}```

lhk

Oct 20, 2022

I've now found the example_script in the files :)
Please disregard my question, I think that script explains everything.

lhk changed discussion status to closed Oct 20, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment