Spaces:

k4black
/

codebleu

Running

App Files Files Community

github-actions commited on Jun 26, 2023

Commit

54ad4a5

•

1 Parent(s): 3d10a5a

Auto files update [main]

Browse files

Files changed (5) hide show

README.md +116 -6
app.py +6 -0
codebleu.py +124 -0
requirements.txt +2 -0
tests.py +17 -0

README.md CHANGED Viewed

@@ -1,12 +1,122 @@
 ---
-title: Codebleu
-emoji: 🐠
-colorFrom: purple
-colorTo: blue
 sdk: gradio
-sdk_version: 3.35.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: codebleu
+tags:
+- evaluate
+- metric
+- code
+- codebleu
+description: "Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB."
 sdk: gradio
+sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
+# Metric Card for codebleu
+***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
+## Metric Description
+Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB.
+> An ideal evaluation metric should consider the grammatical correctness and the logic correctness.
+> We propose weighted n-gram match and syntactic AST match to measure grammatical correctness, and introduce semantic data-flow match to calculate logic correctness.
+> ![CodeBLEU](CodeBLEU.jpg)
+(from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) repo)
+In a nutshell, `CodeBLEU` is a weighted combination of `n-gram match (BLEU)`, `weighted n-gram match (BLEU-weighted)`, `AST match` and `data-flow match` scores.
+The metric has shown higher correlation with human evaluation than `BLEU` and `accuracy` metrics.
+## How to Use
+*Give general statement of how to use the metric*
+*Provide simplest possible example for using the metric*
+### Inputs
+- `refarences` (`list[str]` or `list[list[str]]`): reference code
+- `predictions` (`list[str]`) predicted code
+- `lang` (`str`): code language, see `codebleu.AVAILABLE_LANGS` for available languages (python, c_sharp, java at the moment)
+- `weights` (tuple[float,float,float,float]): weights of the `ngram_match`, `weighted_ngram_match`, `syntax_match`, and `dataflow_match` respectively, defaults to `(0.25, 0.25, 0.25, 0.25)`
+- `tokenizer` (`callable`): to split code string to tokens, defaults to `s.split()`
+### Output Values
+[//]: # (*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*)
+[//]: # (*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*)
+The metric outputs the `dict[str, float]` with following fields:
+- `codebleu`: the final `CodeBLEU` score
+- `ngram_match_score`: `ngram_match` score (BLEU)
+- `weighted_ngram_match_score`: `weighted_ngram_match` score (BLEU-weighted)
+- `syntax_match_score`: `syntax_match` score (AST match)
+- `dataflow_match_score`: `dataflow_match` score
+Each of the scores is in range `[0, 1]`, where `1` is the best score.
+### Examples
+[//]: # (*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*)
+Using pip package (`pip install codebleu`):
+```python
+from codebleu import calc_codebleu
+prediction = "def add ( a , b ) :\n return a + b"
+reference = "def sum ( first , second ) :\n return second + first"
+result = calc_codebleu([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
+print(result)
+# {
+#   'codebleu': 0.5537,
+#   'ngram_match_score': 0.1041,
+#   'weighted_ngram_match_score': 0.1109,
+#   'syntax_match_score': 1.0,
+#   'dataflow_match_score': 1.0
+# }
+```
+Or using `evaluate` library (package required):
+```python
+import evaluate
+metric = evaluate.load("dvitel/codebleu")
+prediction = "def add ( a , b ) :\n return a + b"
+reference = "def sum ( first , second ) :\n return second + first"
+result = metric.compute([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
+```
+Note: `language` is required;
+## Limitations and Bias
+[//]: # (*Note any known limitations or biases that the metric has, with links and references if possible.*)
+As this library require `so` file compilation it is platform dependent.
+Currently available for Linux (manylinux) and MacOS on Python 3.8+.
+## Citation
+```bibtex
+@misc{ren2020codebleu,
+      title={CodeBLEU: a Method for Automatic Evaluation of Code Synthesis},
+      author={Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma},
+      year={2020},
+      eprint={2009.10297},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE}
+}
+```
+## Further References
+This implementation is Based on original [CodeXGLUE/CodeBLEU](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) code -- refactored, build for macos, tested and fixed multiple crutches to make it more usable.
+The source code is available at GitHub [k4black/codebleu](https://github.com/k4black/codebleu) repository.

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("k4black/codebleu")
+launch_gradio_widget(module)

codebleu.py ADDED Viewed

	@@ -0,0 +1,124 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TODO: Add a description here."""
+import evaluate
+import datasets
+from codebleu import calc_codebleu
+# TODO: Add BibTeX citation
+_CITATION = """\
+@misc{ren2020codebleu,
+      title={CodeBLEU: a Method for Automatic Evaluation of Code Synthesis},
+      author={Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma},
+      year={2020},
+      eprint={2009.10297},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE}
+}
+"""
+# TODO: Add description of the module here
+_DESCRIPTION = """\
+Unofficial `CodeBLEU` implementation with Linux and MacOS supports available with PyPI and HF HUB.
+Based on original [CodeXGLUE/CodeBLEU](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) code -- refactored, build for macos, tested and fixed multiple crutches to make it more usable.
+"""
+# TODO: Add description of the arguments of the module here
+_KWARGS_DESCRIPTION = """
+Calculate a weighted combination of `n-gram match (BLEU)`, `weighted n-gram match (BLEU-weighted)`, `AST match` and `data-flow match` scores.
+Args:
+    predictions: list of predictions to score. Each predictions
+        should be a string with tokens separated by spaces.
+    references: list of reference for each prediction. Each
+        reference should be a string with tokens separated by spaces.
+    language: programming language in ['java','js','c_sharp','php','go','python','ruby'].
+    weights: tuple of 4 floats to use as weights for scores. Defaults to (0.25, 0.25, 0.25, 0.25).
+Returns:
+    codebleu: resulting `CodeBLEU` score,
+    ngram_match_score: resulting `n-gram match (BLEU)` score,
+    weighted_ngram_match_score: resulting `weighted n-gram match (BLEU-weighted)` score,
+    syntax_match_score: resulting `AST match` score,
+    dataflow_match_score: resulting `data-flow match` score,
+Examples:
+    >>> metric = evaluate.load("k4black/codebleu")
+    >>> ref = "def sum ( first , second ) :\n return second + first"
+    >>> pred = "def add ( a , b ) :\n return a + b"
+    >>> results = metric.compute(references=[ref], predictions=[pred], language="python")
+    >>> print(results)
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class codebleu(evaluate.Metric):
+    """CodeBLEU metric from CodexGLUE"""
+    def _info(self):
+        # TODO: Specifies the evaluate.EvaluationModuleInfo object
+        return evaluate.MetricInfo(
+            # This is the description that will appear on the modules page.
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=[
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
+                        # "lang": datasets.Value("string"),
+                        # "weights": datasets.Value("string"),
+                        # "tokenizer": datasets.Value("string"),
+                    }
+                ),
+                datasets.Features(
+                    {
+                        "predictions": datasets.Value("string", id="sequence"),
+                        "references": datasets.Value("string", id="sequence"),
+                        # "lang": datasets.Value("string"),
+                        # "weights": datasets.Value("string"),
+                        # "tokenizer": datasets.Value("string"),
+                    }
+                ),
+            ],
+            # Homepage of the module for documentation
+            homepage="https://github.com/k4black/codebleu",
+            # Additional links to the codebase or references
+            codebase_urls=["https://github.com/k4black/codebleu"],
+            reference_urls=[
+                "https://github.com/k4black/codebleu",
+                "https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator",
+                "https://arxiv.org/abs/2009.10297",
+            ],
+        )
+    def _download_and_prepare(self, dl_manager):
+        """Optional: download external resources useful to compute the scores"""
+        # TODO: Download external resources if needed
+        pass
+    def _compute(self, predictions, references, lang, weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None):
+        """Returns the scores"""
+        return calc_codebleu(
+            references=references,
+            predictions=predictions,
+            lang=lang,
+            weights=weights,
+            tokenizer=tokenizer,
+        )

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ git+https://github.com/huggingface/evaluate@main
2	+ codebleu

tests.py ADDED Viewed

	@@ -0,0 +1,17 @@

+test_cases = [
+    {
+        "predictions": [0, 0],
+        "references": [1, 1],
+        "result": {"metric_score": 0}
+    },
+    {
+        "predictions": [1, 1],
+        "references": [1, 1],
+        "result": {"metric_score": 1}
+    },
+    {
+        "predictions": [1, 0],
+        "references": [1, 1],
+        "result": {"metric_score": 0.5}
+    }
+]