venkatasg commited on
Commit
24fa801
·
1 Parent(s): eca243b

Doesn’t work yet but I’ve put in all the source code and comments (I think)

Browse files
Files changed (2) hide show
  1. README.md +6 -28
  2. gleu.py +156 -17
README.md CHANGED
@@ -3,7 +3,7 @@ title: gleu
3
  tags:
4
  - evaluate
5
  - metric
6
- description: "TODO: add a description here"
7
  sdk: gradio
8
  sdk_version: 3.19.1
9
  app_file: app.py
@@ -12,37 +12,15 @@ pinned: false
12
 
13
  # Metric Card for gleu
14
 
15
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
16
-
17
  ## Metric Description
18
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
19
 
 
 
20
  ## How to Use
21
- *Give general statement of how to use the metric*
22
-
23
- *Provide simplest possible example for using the metric*
24
-
25
- ### Inputs
26
- *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
28
-
29
- ### Output Values
30
-
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
32
-
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
34
-
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
37
-
38
- ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
40
 
41
- ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
43
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
46
 
47
- ## Further References
48
- *Add any useful further references.*
 
3
  tags:
4
  - evaluate
5
  - metric
6
+ description: "Generalized Language Evaluation Understanding (GLEU) is a metric initially developed for Grammatical Error Correction (GEC), that builds upon BLEU by rewarding corrections while also correctly crediting unchanged source text."
7
  sdk: gradio
8
  sdk_version: 3.19.1
9
  app_file: app.py
 
12
 
13
  # Metric Card for gleu
14
 
 
 
15
  ## Metric Description
 
16
 
17
+ GLEU metric can be used for any monolingual "translation" task, that is it can be used for Grammatical Error Correction and other text re-writing tasks. BLEU computes n-gram precisions over the reference but assigns more weight to n-grams that have been correctly changed from the source. GLEU rewards corrections while also correctly crediting unchanged source text.
18
+
19
  ## How to Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ Follow instructions on [the GitHub repository](https://github.com/cnap/gec-ranking) for the cited papers.
 
22
 
23
  ## Citation
 
24
 
25
+ - [Ground Truth for Grammatical Error Correction Metrics](https://aclanthology.org/P15-2097) by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post. *Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing*.
26
+ - [GLEU Without Tuning](https://arxiv.org/abs/1605.02592) by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post. arXiv.
gleu.py CHANGED
@@ -15,20 +15,29 @@
15
 
16
  import evaluate
17
  import datasets
 
 
 
18
 
19
 
20
  # TODO: Add BibTeX citation
21
  _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
 
 
 
 
 
 
26
  }
27
  """
28
 
29
  # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
  """
33
 
34
 
@@ -36,13 +45,12 @@ This new module is designed to solve this great ML task and is crafted with a lo
36
  _KWARGS_DESCRIPTION = """
37
  Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
  Examples:
47
  Examples should be written in doctest format, and should illustrate how
48
  to use the function.
@@ -56,6 +64,99 @@ Examples:
56
  # TODO: Define external resources urls if needed
57
  BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class gleu(evaluate.Metric):
@@ -85,11 +186,49 @@ class gleu(evaluate.Metric):
85
  """Optional: download external resources useful to compute the scores"""
86
  # TODO: Download external resources if needed
87
  pass
 
 
 
88
 
89
- def _compute(self, predictions, references):
90
  """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
- return {
94
- "accuracy": accuracy,
95
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  import evaluate
17
  import datasets
18
+ from collections import Counter
19
+ from math import log, exp
20
+ from random import seed, randint
21
 
22
 
23
  # TODO: Add BibTeX citation
24
  _CITATION = """\
25
+ @InProceedings{napoles-EtAl:2015:ACL-IJCNLP,
26
+ author = {Napoles, Courtney and Sakaguchi, Keisuke and Post, Matt and Tetreault, Joel},
27
+ title = {Ground Truth for Grammatical Error Correction Metrics},
28
+ booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
29
+ month = {July},
30
+ year = {2015},
31
+ address = {Beijing, China},
32
+ publisher = {Association for Computational Linguistics},
33
+ pages = {588--593},
34
+ url = {http://www.aclweb.org/anthology/P15-2097}
35
  }
36
  """
37
 
38
  # TODO: Add description of the module here
39
  _DESCRIPTION = """\
40
+ GLEU metric can be used for any monolingual "translation" task, that is it can be used for Grammatical Error Correction and other text re-writing tasks. BLEU computes n-gram precisions over the reference but assigns more weight to n-grams that have been correctly changed from the source. GLEU rewards corrections while also correctly crediting unchanged source text.
41
  """
42
 
43
 
 
45
  _KWARGS_DESCRIPTION = """
46
  Calculates how good are predictions given some references, using certain scores
47
  Args:
48
+ sources: Source language reference sentences. This is assumed to be same as references if not provided
49
+ references: list of reference for each prediction. Each reference should be a string with tokens separated by spaces.
50
+ predictions: list of predictions to score. Each predictions should be a string with tokens separated by spaces.
 
51
  Returns:
52
+ gleu_score: description of the first score,
53
+
54
  Examples:
55
  Examples should be written in doctest format, and should illustrate how
56
  to use the function.
 
64
  # TODO: Define external resources urls if needed
65
  BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
66
 
67
+ class GLEU():
68
+
69
+ def load_hypothesis_sentence(self,hypothesis) :
70
+ self.hlen = len(hypothesis)
71
+ self.this_h_ngrams = [ self.get_ngram_counts(hypothesis,n)
72
+ for n in range(1,self.order+1) ]
73
+
74
+ def load_sources(self,spath) :
75
+ self.all_s_ngrams = [ [ self.get_ngram_counts(line.split(),n)
76
+ for n in range(1,self.order+1) ]
77
+ for line in open(spath) ]
78
+
79
+ def load_references(self,rpaths) :
80
+ self.refs = [ [] for i in range(len(self.all_s_ngrams)) ]
81
+ self.rlens = [ [] for i in range(len(self.all_s_ngrams)) ]
82
+ for rpath in rpaths :
83
+ for i,line in enumerate(open(rpath)) :
84
+ self.refs[i].append(line.split())
85
+ self.rlens[i].append(len(line.split()))
86
+
87
+ # count number of references each n-gram appear sin
88
+ self.all_rngrams_freq = [ Counter() for i in range(self.order) ]
89
+
90
+ self.all_r_ngrams = [ ]
91
+ for refset in self.refs :
92
+ all_ngrams = []
93
+ self.all_r_ngrams.append(all_ngrams)
94
+
95
+ for n in range(1,self.order+1) :
96
+ ngrams = self.get_ngram_counts(refset[0],n)
97
+ all_ngrams.append(ngrams)
98
+
99
+ for k in ngrams.keys() :
100
+ self.all_rngrams_freq[n-1][k]+=1
101
+
102
+ for ref in refset[1:] :
103
+ new_ngrams = self.get_ngram_counts(ref,n)
104
+ for nn in new_ngrams.elements() :
105
+ if new_ngrams[nn] > ngrams.get(nn,0) :
106
+ ngrams[nn] = new_ngrams[nn]
107
+
108
+ def get_ngram_counts(self,sentence,n) :
109
+ return Counter([tuple(sentence[i:i+n])
110
+ for i in xrange(len(sentence)+1-n)])
111
+
112
+ # returns ngrams in a but not in b
113
+ def get_ngram_diff(self,a,b) :
114
+ diff = Counter(a)
115
+ for k in (set(a) & set(b)) :
116
+ del diff[k]
117
+ return diff
118
+
119
+ def normalization(self,ngram,n) :
120
+ return 1.0*self.all_rngrams_freq[n-1][ngram]/len(self.rlens[0])
121
+
122
+ # Collect BLEU-relevant statistics for a single hypothesis/reference pair.
123
+ # Return value is a generator yielding:
124
+ # (c, r, numerator1, denominator1, ... numerator4, denominator4)
125
+ # Summing the columns across calls to this function on an entire corpus
126
+ # will produce a vector of statistics that can be used to compute GLEU
127
+ def gleu_stats(self,i,r_ind=None):
128
+
129
+ hlen = self.hlen
130
+ rlen = self.rlens[i][r_ind]
131
+
132
+ yield hlen
133
+ yield rlen
134
+
135
+ for n in xrange(1,self.order+1):
136
+ h_ngrams = self.this_h_ngrams[n-1]
137
+ s_ngrams = self.all_s_ngrams[i][n-1]
138
+ r_ngrams = self.get_ngram_counts(self.refs[i][r_ind],n)
139
+
140
+ s_ngram_diff = self.get_ngram_diff(s_ngrams,r_ngrams)
141
+
142
+ yield max([ sum( (h_ngrams & r_ngrams).values() ) - \
143
+ sum( (h_ngrams & s_ngram_diff).values() ), 0 ])
144
+
145
+ yield max([hlen+1-n, 0])
146
+
147
+ # Compute GLEU from collected statistics obtained by call(s) to gleu_stats
148
+ def compute_gleu(self,stats,smooth=False):
149
+ # smooth 0 counts for sentence-level scores
150
+ if smooth :
151
+ stats = [ s if s != 0 else 1 for s in stats ]
152
+ if len(filter(lambda x: x==0, stats)) > 0:
153
+ return 0
154
+ (c, r) = stats[:2]
155
+ log_gleu_prec = sum([math.log(float(x)/y)
156
+ for x,y in zip(stats[2::2],stats[3::2])]) / 4
157
+ return math.exp(min([0, 1-float(r)/c]) + log_gleu_prec)
158
+
159
+
160
 
161
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
162
  class gleu(evaluate.Metric):
 
186
  """Optional: download external resources useful to compute the scores"""
187
  # TODO: Download external resources if needed
188
  pass
189
+
190
+ def __init__(self, order=4) :
191
+ self.order = order
192
 
193
+ def _compute(self, sources=None, references, predictions):
194
  """Returns the scores"""
195
+
196
+ num_iterations = 500
197
+
198
+ if len(references)==1:
199
+ num_iterations = 1
200
+
201
+ gleu_calculator = GLEU(self.order)
202
+
203
+ if sources:
204
+ gleu_calculator.load_sources(sources)
205
+ else:
206
+ gleu_calculator.load_sources(references)
207
+
208
+ gleu_calculator.load_references(references)
209
+
210
+ # first generate a random list of indices, using a different seed
211
+ # for each iteration
212
+ indices = []
213
+ for j in range(num_iterations) :
214
+ seed(j*101)
215
+ indices.append([randint(0,len(args.reference)-1) for i in range(len(predictions))])
216
+
217
+ for i,h in enumerate(predictions) :
218
+ gleu_calculator.load_hypothesis_sentence(h)
219
+
220
+ # we are going to store the score of this sentence for each ref
221
+ # so we don't have to recalculate them 500 times
222
+
223
+ stats_by_ref = [ None for r in range(len(references)) ]
224
+
225
+ for j in range(num_iterations) :
226
+ ref = indices[j][i]
227
+ this_stats = stats_by_ref[ref]
228
+
229
+ if this_stats is None :
230
+ this_stats = [ s for s in gleu_calculator.gleu_stats(i,r_ind=ref) ]
231
+ stats_by_ref[ref] = this_stats
232
+
233
+ iter_stats[j] = [sum(scores) or scores in zip(iter_stats[j], this_stats)]
234
+