File size: 1,871 Bytes
ee6e328
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
### Motivation
Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
With post processing, it can score 37..
Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758)



### Instructions
Note: You need to have your test_generations.txt before you start this process.
(1) Setup `mosesdecoder` and `wmt16-scripts`
```bash
cd $HOME
git clone git@github.com:moses-smt/mosesdecoder.git
cd mosesdecoder  
git clone git@github.com:rsennrich/wmt16-scripts.git
```

(2) define a function for post processing.
 It removes diacritics and does other things I don't understand 
```bash
ro_post_process () {
  sys=$1
  ref=$2
  export MOSES_PATH=$HOME/mosesdecoder
  REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl
  NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl
  REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl
  REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py
  NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py
  TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl



  lang=ro
  for file in $sys $ref; do
    cat $file \
    | $REPLACE_UNICODE_PUNCT \
    | $NORM_PUNC -l $lang \
    | $REM_NON_PRINT_CHAR \
    | $NORMALIZE_ROMANIAN \
    | $REMOVE_DIACRITICS \
    | $TOKENIZER -no-escape -l $lang \
    > $(basename $file).tok
  done
  # compute BLEU
  cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
}
```

(3) Call the function on test_generations.txt and test.target
For example,
```bash
ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
```
This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs.









```