File size: 3,265 Bytes
d2a82d9
dbf7be3
 
 
 
 
d2a82d9
dbf7be3
d2a82d9
 
 
 
dbf7be3
 
 
08dc526
 
 
 
5657ae4
dbf7be3
 
 
08dc526
5657ae4
08dc526
 
 
 
 
 
 
 
dbf7be3
 
08dc526
 
 
 
 
 
 
 
dbf7be3
 
 
08dc526
 
 
 
 
dbf7be3
 
 
 
 
 
 
 
08dc526
dbf7be3
 
08dc526
 
 
 
 
 
 
dbf7be3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: codebleu
tags:
- evaluate
- metric
description: "CodeBLEU"
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
---

# Metric Card for CodeBLEU

## Metric Description

CodeBLEU from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator) 
and from article [CodeBLEU: a Method for Automatic Evaluation of Code Synthesis](https://arxiv.org/abs/2009.10297)

NOTE: currently works on Linux machines only due to dependency from languages .so

## How to Use

```python
module = evaluate.load("dvitel/codebleu")
src = 'class AcidicSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
tgt = 'class AcidSwampOoze(MinionCard):§    def __init__(self):§        super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§    def create_minion(self, player):§        return Minion(3, 2)§'
src = src.replace("Β§","\n")
tgt = tgt.replace("Β§","\n")
res = module.compute(predictions = [tgt], references = [[src]])
print(res)
#{'CodeBLEU': 0.9473264567644872, 'ngram_match_score': 0.8915993127600096, 'weighted_ngram_match_score': 0.8977065142979394, 'syntax_match_score': 1.0, 'dataflow_match_score': 1.0}
```

### Inputs
- **predictions** (`list` of `str`s): Translations to score.
- **references** (`list` of `list`s of `str`s): references for each translation.
- **lang** programming language in ['java','js','c_sharp','php','go','python','ruby']
- **tokenizer**: approach used for standardizing `predictions` and `references`.
    The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
    This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).        
- **params**: str, weights for averaging(see CodeBLEU paper). 
        Defaults to equal weights "0.25,0.25,0.25,0.25".

### Output Values

- CodeBLEU: resulting score,
- ngram_match_score: See paper CodeBLEU,
- weighted_ngram_match_score: See paper CodeBLEU,
- syntax_match_score: See paper CodeBLEU,
- dataflow_match_score: See paper CodeBLEU,

#### Values from Popular Papers
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*

### Examples
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*

## Limitations and Bias
Linux OS only. See above a set of programming languages supported.

## Citation
```bibtex
@InProceedings{huggingface:module,
title = {CodeBLEU: A Metric for Evaluating Code Generation},
authors={Sedykh, Ivan},
year={2022}
}
```

## Further References
*Add any useful further references.*