tldr_eval / README.md
shuyanzh's picture
add descriptions
b20012f
metadata
title: NL to Bash Generation Eval
emoji: 🤗
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 3.15.0
app_file: app.py
pinned: false

Metric Description

The evaluation metrics for natural language to bash generation. The preprocessing is customized for tldr dataset where we first conduct annoymization on the variables.

How to Use

This metric takes as input a list of predicted sentences and a list of reference sentences:

predictions = ["ipcrm --shmem-id {{segment_id}}", 
    "trash-empty --keep-files {{path/to/file_or_directory}}"]
references = ["ipcrm --shmem-id {{shmem_id}}",
    "trash-empty {{10}}"]
tldr_metrics = evaluate.load("neulab/tldr_eval")
results = tldr_metrics.compute(predictions=predictions, references=references)
print(results)
>>> {'template_matching': 0.5, 'command_accuracy': 1.0, 'bleu_char': 65.67965919013294, 'token_recall': 0.9999999999583333, 'token_precision': 0.8333333333055555, 'token_f1': 0.8999999999183333}

Inputs

  • predictions (list of strs): Predictions to score.
  • references (list of strs): References

Output Values

  • template_matching: the exact match accuracy
  • command_accuracy: accuracy of predicting the correct bash command name (e.g., ls)
  • bleu_char: char bleu score
  • token recall/precision/f1: the recall/precision/f1 of the predicted tokens

Citation

  title={DocCoder: Generating Code by Retrieving and Reading Docs},
  author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and Jiang, Zhengbao and Neubig, Graham},
  journal={arXiv preprint arXiv:2207.05987},
  year={2022}
}