shuyanzh commited on
Commit
b20012f
1 Parent(s): 8ab7e68

add descriptions

Browse files
Files changed (2) hide show
  1. README.md +32 -32
  2. tldr_eval.py +6 -2
README.md CHANGED
@@ -1,48 +1,48 @@
1
  ---
2
- title: bash_eval
3
- tags:
4
- - evaluate
5
- - metric
6
- description: "TODO: add a description here"
7
  sdk: gradio
8
- sdk_version: 3.0.2
9
  app_file: app.py
10
  pinned: false
11
  ---
12
-
13
- # Metric Card for bash_eval
14
-
15
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
16
-
17
  ## Metric Description
18
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
19
 
20
  ## How to Use
21
- *Give general statement of how to use the metric*
22
 
23
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ### Inputs
26
- *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
28
 
29
  ### Output Values
 
 
 
 
30
 
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
32
-
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
34
-
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
37
-
38
- ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
40
-
41
- ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
43
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
46
-
47
- ## Further References
48
- *Add any useful further references.*
 
 
 
1
  ---
2
+ title: NL to Bash Generation Eval
3
+ emoji: 🤗
4
+ colorFrom: indigo
5
+ colorTo: green
 
6
  sdk: gradio
7
+ sdk_version: 3.15.0
8
  app_file: app.py
9
  pinned: false
10
  ---
 
 
 
 
 
11
  ## Metric Description
12
+ The evaluation metrics for natural language to bash generation.
13
+ The preprocessing is customized for [`tldr`](https://github.com/tldr-pages/tldr) dataset where we first conduct annoymization on the variables.
14
 
15
  ## How to Use
 
16
 
17
+ This metric takes as input a list of predicted sentences and a list of reference sentences:
18
+
19
+ ```python
20
+ predictions = ["ipcrm --shmem-id {{segment_id}}",
21
+ "trash-empty --keep-files {{path/to/file_or_directory}}"]
22
+ references = ["ipcrm --shmem-id {{shmem_id}}",
23
+ "trash-empty {{10}}"]
24
+ tldr_metrics = evaluate.load("neulab/tldr_eval")
25
+ results = tldr_metrics.compute(predictions=predictions, references=references)
26
+ print(results)
27
+ >>> {'template_matching': 0.5, 'command_accuracy': 1.0, 'bleu_char': 65.67965919013294, 'token_recall': 0.9999999999583333, 'token_precision': 0.8333333333055555, 'token_f1': 0.8999999999183333}
28
+ ```
29
 
30
  ### Inputs
31
+ - **predictions** (`list` of `str`s): Predictions to score.
32
+ - **references** (`list` of `str`s): References
33
 
34
  ### Output Values
35
+ - **template_matching**: the exact match accuracy
36
+ - **command_accuracy**: accuracy of predicting the correct bash command name (e.g., `ls`)
37
+ - **bleu_char**: char bleu score
38
+ - **token recall/precision/f1**: the recall/precision/f1 of the predicted tokens
39
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## Citation
42
+ ```@article{zhou2022doccoder,
43
+ title={DocCoder: Generating Code by Retrieving and Reading Docs},
44
+ author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and Jiang, Zhengbao and Neubig, Graham},
45
+ journal={arXiv preprint arXiv:2207.05987},
46
+ year={2022}
47
+ }
48
+ ```
tldr_eval.py CHANGED
@@ -31,7 +31,8 @@ _CITATION = """\
31
  """
32
 
33
  _DESCRIPTION = """\
34
- This metric is used to evaluate the quality of a generated bash script.
 
35
  """
36
 
37
 
@@ -40,7 +41,10 @@ predictions: list of str. The predictions
40
  references: list of str. The references
41
 
42
  Return
43
-
 
 
 
44
  """
45
 
46
  VAR_STR = "[[VAR]]"
31
  """
32
 
33
  _DESCRIPTION = """\
34
+ The evaluation metrics for natural language to bash generation.
35
+ The preprocessing is customized for [`tldr`](https://github.com/tldr-pages/tldr) dataset where we first conduct annoymization on the variables.
36
  """
37
 
38
 
41
  references: list of str. The references
42
 
43
  Return
44
+ - **template_matching**: the exact match accuracy
45
+ - **command_accuracy**: accuracy of predicting the correct bash command name (e.g., `ls`)
46
+ - **bleu_char**: char bleu score
47
+ - **token recall/precision/f1**: the recall/precision/f1 of the predicted tokens
48
  """
49
 
50
  VAR_STR = "[[VAR]]"