Spaces:

venkatasg
/

gleu

Sleeping

App Files Files Community

venkatasg commited on Jul 5, 2024

Commit

24fa801

1 Parent(s): eca243b

Doesn’t work yet but I’ve put in all the source code and comments (I think)

Browse files

Files changed (2) hide show

README.md +6 -28
gleu.py +156 -17

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ title: gleu
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
@@ -12,37 +12,15 @@ pinned: false
 # Metric Card for gleu
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
-### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: "Generalized Language Evaluation Understanding (GLEU) is a metric initially developed for Grammatical Error Correction (GEC), that builds upon BLEU by rewarding corrections while also correctly crediting unchanged source text."
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 # Metric Card for gleu
 ## Metric Description
+ GLEU metric can be used for any monolingual "translation" task, that is it can be used for Grammatical Error Correction and other text re-writing tasks. BLEU  computes n-gram precisions over the reference but assigns more weight to n-grams that have been correctly changed from the source. GLEU rewards corrections while also correctly crediting unchanged source text.
 ## How to Use
+Follow instructions on [the GitHub repository](https://github.com/cnap/gec-ranking) for the cited papers.
 ## Citation
+- [Ground Truth for Grammatical Error Correction Metrics](https://aclanthology.org/P15-2097) by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post. *Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing*.
+- [GLEU Without Tuning](https://arxiv.org/abs/1605.02592) by Courtney Napoles, Keisuke Sakaguchi, Joel Tetreault, and Matt Post. arXiv.

gleu.py CHANGED Viewed

@@ -15,20 +15,29 @@
 import evaluate
 import datasets
 # TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
 # TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
@@ -36,13 +45,12 @@ This new module is designed to solve this great ML task and is crafted with a lo
 _KWARGS_DESCRIPTION = """
 Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
     Examples should be written in doctest format, and should illustrate how
     to use the function.
@@ -56,6 +64,99 @@ Examples:
 # TODO: Define external resources urls if needed
 BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class gleu(evaluate.Metric):
@@ -85,11 +186,49 @@ class gleu(evaluate.Metric):
         """Optional: download external resources useful to compute the scores"""
         # TODO: Download external resources if needed
         pass
-    def _compute(self, predictions, references):
         """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

 import evaluate
 import datasets
+from collections import Counter
+from math import log, exp
+from random import seed, randint
 # TODO: Add BibTeX citation
 _CITATION = """\
+@InProceedings{napoles-EtAl:2015:ACL-IJCNLP,
+  author    = {Napoles, Courtney  and  Sakaguchi, Keisuke  and  Post, Matt  and  Tetreault, Joel},
+  title     = {Ground Truth for Grammatical Error Correction Metrics},
+  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
+  month     = {July},
+  year      = {2015},
+  address   = {Beijing, China},
+  publisher = {Association for Computational Linguistics},
+  pages     = {588--593},
+  url       = {http://www.aclweb.org/anthology/P15-2097}
 }
 """
 # TODO: Add description of the module here
 _DESCRIPTION = """\
+ GLEU metric can be used for any monolingual "translation" task, that is it can be used for Grammatical Error Correction and other text re-writing tasks. BLEU  computes n-gram precisions over the reference but assigns more weight to n-grams that have been correctly changed from the source. GLEU rewards corrections while also correctly crediting unchanged source text.
 """
 _KWARGS_DESCRIPTION = """
 Calculates how good are predictions given some references, using certain scores
 Args:
+    sources: Source language reference sentences. This is assumed to be same as references if not provided
+    references: list of reference for each prediction. Each reference should be a string with tokens separated by spaces.
+    predictions: list of predictions to score. Each predictions should be a string with tokens separated by spaces.
 Returns:
+    gleu_score: description of the first score,
 Examples:
     Examples should be written in doctest format, and should illustrate how
     to use the function.
 # TODO: Define external resources urls if needed
 BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
+class GLEU():
+    def load_hypothesis_sentence(self,hypothesis) :
+        self.hlen = len(hypothesis)
+        self.this_h_ngrams = [ self.get_ngram_counts(hypothesis,n)
+                               for n in range(1,self.order+1) ]
+    def load_sources(self,spath) :
+        self.all_s_ngrams = [ [ self.get_ngram_counts(line.split(),n)
+                                for n in range(1,self.order+1) ]
+                              for line in open(spath) ]
+    def load_references(self,rpaths) :
+        self.refs = [ [] for i in range(len(self.all_s_ngrams)) ]
+        self.rlens = [ [] for i in range(len(self.all_s_ngrams)) ]
+        for rpath in rpaths :
+            for i,line in enumerate(open(rpath)) :
+                self.refs[i].append(line.split())
+                self.rlens[i].append(len(line.split()))
+        # count number of references each n-gram appear sin
+        self.all_rngrams_freq = [ Counter() for i in range(self.order) ]
+        self.all_r_ngrams = [ ]
+        for refset in self.refs :
+            all_ngrams = []
+            self.all_r_ngrams.append(all_ngrams)
+            for n in range(1,self.order+1) :
+                ngrams = self.get_ngram_counts(refset[0],n)
+                all_ngrams.append(ngrams)
+                for k in ngrams.keys() :
+                    self.all_rngrams_freq[n-1][k]+=1
+                for ref in refset[1:] :
+                    new_ngrams = self.get_ngram_counts(ref,n)
+                    for nn in new_ngrams.elements() :
+                        if new_ngrams[nn] > ngrams.get(nn,0) :
+                            ngrams[nn] = new_ngrams[nn]
+    def get_ngram_counts(self,sentence,n) :
+        return Counter([tuple(sentence[i:i+n])
+                        for i in xrange(len(sentence)+1-n)])
+    # returns ngrams in a but not in b
+    def get_ngram_diff(self,a,b) :
+        diff = Counter(a)
+        for k in (set(a) & set(b)) :
+            del diff[k]
+        return diff
+    def normalization(self,ngram,n) :
+        return 1.0*self.all_rngrams_freq[n-1][ngram]/len(self.rlens[0])
+    # Collect BLEU-relevant statistics for a single hypothesis/reference pair.
+    # Return value is a generator yielding:
+    # (c, r, numerator1, denominator1, ... numerator4, denominator4)
+    # Summing the columns across calls to this function on an entire corpus
+    # will produce a vector of statistics that can be used to compute GLEU
+    def gleu_stats(self,i,r_ind=None):
+      hlen = self.hlen
+      rlen = self.rlens[i][r_ind]
+      yield hlen
+      yield rlen
+      for n in xrange(1,self.order+1):
+        h_ngrams = self.this_h_ngrams[n-1]
+        s_ngrams = self.all_s_ngrams[i][n-1]
+        r_ngrams = self.get_ngram_counts(self.refs[i][r_ind],n)
+        s_ngram_diff = self.get_ngram_diff(s_ngrams,r_ngrams)
+        yield max([ sum( (h_ngrams & r_ngrams).values() ) - \
+                    sum( (h_ngrams & s_ngram_diff).values() ), 0 ])
+        yield max([hlen+1-n, 0])
+    # Compute GLEU from collected statistics obtained by call(s) to gleu_stats
+    def compute_gleu(self,stats,smooth=False):
+        # smooth 0 counts for sentence-level scores
+        if smooth :
+            stats = [ s if s != 0 else 1 for s in stats ]
+        if len(filter(lambda x: x==0, stats)) > 0:
+            return 0
+        (c, r) = stats[:2]
+        log_gleu_prec = sum([math.log(float(x)/y)
+                             for x,y in zip(stats[2::2],stats[3::2])]) / 4
+        return math.exp(min([0, 1-float(r)/c]) + log_gleu_prec)
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class gleu(evaluate.Metric):
         """Optional: download external resources useful to compute the scores"""
         # TODO: Download external resources if needed
         pass
+    def __init__(self, order=4) :
+        self.order = order
+    def _compute(self, sources=None, references, predictions):
         """Returns the scores"""
+        num_iterations = 500
+        if len(references)==1:
+            num_iterations = 1
+        gleu_calculator = GLEU(self.order)
+        if sources:
+            gleu_calculator.load_sources(sources)
+        else:
+            gleu_calculator.load_sources(references)
+        gleu_calculator.load_references(references)
+        # first generate a random list of indices, using a different seed
+        # for each iteration
+        indices = []
+        for j in range(num_iterations) :
+            seed(j*101)
+            indices.append([randint(0,len(args.reference)-1) for i in range(len(predictions))])
+        for i,h in enumerate(predictions) :
+            gleu_calculator.load_hypothesis_sentence(h)
+            # we are going to store the score of this sentence for each ref
+            # so we don't have to recalculate them 500 times
+            stats_by_ref = [ None for r in range(len(references)) ]
+            for j in range(num_iterations) :
+                ref = indices[j][i]
+                this_stats = stats_by_ref[ref]
+                if this_stats is None :
+                    this_stats = [ s for s in gleu_calculator.gleu_stats(i,r_ind=ref) ]
+                    stats_by_ref[ref] = this_stats
+                iter_stats[j] = [sum(scores) or scores in zip(iter_stats[j], this_stats)]