metadata
title: NL to Bash Generation Eval
emoji: 🤗
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 3.15.0
app_file: app.py
pinned: false
Metric Description
The evaluation metrics for natural language to bash generation.
The preprocessing is customized for tldr
dataset where we first conduct annoymization on the variables.
How to Use
This metric takes as input a list of predicted sentences and a list of reference sentences:
predictions = ["ipcrm --shmem-id {{segment_id}}",
"trash-empty --keep-files {{path/to/file_or_directory}}"]
references = ["ipcrm --shmem-id {{shmem_id}}",
"trash-empty {{10}}"]
tldr_metrics = evaluate.load("neulab/tldr_eval")
results = tldr_metrics.compute(predictions=predictions, references=references)
print(results)
>>> {'template_matching': 0.5, 'command_accuracy': 1.0, 'bleu_char': 65.67965919013294, 'token_recall': 0.9999999999583333, 'token_precision': 0.8333333333055555, 'token_f1': 0.8999999999183333}
Inputs
- predictions (
list
ofstr
s): Predictions to score. - references (
list
ofstr
s): References
Output Values
- template_matching: the exact match accuracy
- command_accuracy: accuracy of predicting the correct bash command name (e.g.,
ls
) - bleu_char: char bleu score
- token recall/precision/f1: the recall/precision/f1 of the predicted tokens
Citation
title={DocCoder: Generating Code by Retrieving and Reading Docs},
author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and Jiang, Zhengbao and Neubig, Graham},
journal={arXiv preprint arXiv:2207.05987},
year={2022}
}