README.md · neulab/tldr

metadata

title: NL to Bash Generation Eval
emoji: 🤗
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 3.15.0
app_file: app.py
pinned: false

Metric Description

The evaluation metrics for natural language to bash generation. The preprocessing is customized for tldr dataset where we first conduct annoymization on the variables.

How to Use

This metric takes as input a list of predicted sentences and a list of reference sentences:

predictions = ["ipcrm --shmem-id {{segment_id}}", 
    "trash-empty --keep-files {{path/to/file_or_directory}}"]
references = ["ipcrm --shmem-id {{shmem_id}}",
    "trash-empty {{10}}"]
tldr_metrics = evaluate.load("neulab/tldr_eval")
results = tldr_metrics.compute(predictions=predictions, references=references)
print(results)
>>> {'template_matching': 0.5, 'command_accuracy': 1.0, 'bleu_char': 65.67965919013294, 'token_recall': 0.9999999999583333, 'token_precision': 0.8333333333055555, 'token_f1': 0.8999999999183333}

Inputs

predictions (list of strs): Predictions to score.
references (list of strs): References

Output Values

template_matching: the exact match accuracy
command_accuracy: accuracy of predicting the correct bash command name (e.g., ls)
bleu_char: char bleu score
token recall/precision/f1: the recall/precision/f1 of the predicted tokens

Citation

  title={DocCoder: Generating Code by Retrieving and Reading Docs},
  author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and Jiang, Zhengbao and Neubig, Graham},
  journal={arXiv preprint arXiv:2207.05987},
  year={2022}
}