Spaces:

neulab
/

tldr_eval

Runtime error

App Files Files Community

shuyanzh commited on Dec 23, 2022

Commit

b20012f

•

1 Parent(s): 8ab7e68

add descriptions

Browse files

Files changed (2) hide show

README.md +32 -32
tldr_eval.py +6 -2

README.md CHANGED Viewed

@@ -1,48 +1,48 @@
 ---
-title: bash_eval
-tags:
-- evaluate
-- metric
-description: "TODO: add a description here"
 sdk: gradio
-sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-# Metric Card for bash_eval
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 ---
+title: NL to Bash Generation Eval
+emoji: 🤗
+colorFrom: indigo
+colorTo: green
 sdk: gradio
+sdk_version: 3.15.0
 app_file: app.py
 pinned: false
 ---
 ## Metric Description
+The evaluation metrics for natural language to bash generation.
+The preprocessing is customized for [`tldr`](https://github.com/tldr-pages/tldr) dataset where we first conduct annoymization on the variables.
 ## How to Use
+This metric takes as input a list of predicted sentences and a list of reference sentences:
+```python
+predictions = ["ipcrm --shmem-id {{segment_id}}",
+    "trash-empty --keep-files {{path/to/file_or_directory}}"]
+references = ["ipcrm --shmem-id {{shmem_id}}",
+    "trash-empty {{10}}"]
+tldr_metrics = evaluate.load("neulab/tldr_eval")
+results = tldr_metrics.compute(predictions=predictions, references=references)
+print(results)
+>>> {'template_matching': 0.5, 'command_accuracy': 1.0, 'bleu_char': 65.67965919013294, 'token_recall': 0.9999999999583333, 'token_precision': 0.8333333333055555, 'token_f1': 0.8999999999183333}
+```
 ### Inputs
+- **predictions** (`list` of `str`s): Predictions to score.
+- **references** (`list` of `str`s): References
 ### Output Values
+- **template_matching**: the exact match accuracy
+- **command_accuracy**: accuracy of predicting the correct bash command name (e.g., `ls`)
+- **bleu_char**: char bleu score
+- **token recall/precision/f1**: the recall/precision/f1 of the predicted tokens
 ## Citation
+```@article{zhou2022doccoder,
+  title={DocCoder: Generating Code by Retrieving and Reading Docs},
+  author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and Jiang, Zhengbao and Neubig, Graham},
+  journal={arXiv preprint arXiv:2207.05987},
+  year={2022}
+}
+```

tldr_eval.py CHANGED Viewed

@@ -31,7 +31,8 @@ _CITATION = """\
 """
 _DESCRIPTION = """\
-This metric is used to evaluate the quality of a generated bash script.
 """
@@ -40,7 +41,10 @@ predictions: list of str. The predictions
 references: list of str. The references
 Return
 """
 VAR_STR = "[[VAR]]"

 """
 _DESCRIPTION = """\
+The evaluation metrics for natural language to bash generation.
+The preprocessing is customized for [`tldr`](https://github.com/tldr-pages/tldr) dataset where we first conduct annoymization on the variables.
 """
 references: list of str. The references
 Return
+- **template_matching**: the exact match accuracy
+- **command_accuracy**: accuracy of predicting the correct bash command name (e.g., `ls`)
+- **bleu_char**: char bleu score
+- **token recall/precision/f1**: the recall/precision/f1 of the predicted tokens
 """
 VAR_STR = "[[VAR]]"