codebleu / README.md
dvitel's picture
fix readme
5657ae4
metadata
title: codebleu
tags:
  - evaluate
  - metric
description: CodeBLEU
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false

Metric Card for CodeBLEU

Metric Description

CodeBLEU from CodeXGLUE and from article CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

NOTE: currently works on Linux machines only due to dependency from languages .so

How to Use

module = evaluate.load("dvitel/codebleu")
src = 'class AcidicSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
tgt = 'class AcidSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
src = src.replace("Β§","\n")
tgt = tgt.replace("Β§","\n")
res = module.compute(predictions = [tgt], references = [[src]])
print(res)
#{'CodeBLEU': 0.9473264567644872, 'ngram_match_score': 0.8915993127600096, 'weighted_ngram_match_score': 0.8977065142979394, 'syntax_match_score': 1.0, 'dataflow_match_score': 1.0}

Inputs

  • predictions (list of strs): Translations to score.
  • references (list of lists of strs): references for each translation.
  • lang programming language in ['java','js','c_sharp','php','go','python','ruby']
  • tokenizer: approach used for standardizing predictions and references. The default tokenizer is tokenizer_13a, a relatively minimal tokenization approach that is however equivalent to mteval-v13a, used by WMT. This can be replaced by another tokenizer from a source such as SacreBLEU.
  • params: str, weights for averaging(see CodeBLEU paper). Defaults to equal weights "0.25,0.25,0.25,0.25".

Output Values

  • CodeBLEU: resulting score,
  • ngram_match_score: See paper CodeBLEU,
  • weighted_ngram_match_score: See paper CodeBLEU,
  • syntax_match_score: See paper CodeBLEU,
  • dataflow_match_score: See paper CodeBLEU,

Values from Popular Papers

Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.

Examples

Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.

Limitations and Bias

Linux OS only. See above a set of programming languages supported.

Citation

@InProceedings{huggingface:module,
title = {CodeBLEU: A Metric for Evaluating Code Generation},
authors={Sedykh, Ivan},
year={2022}
}

Further References

Add any useful further references.