Spaces:

chendl
/

multimodal

Runtime error

multimodal / transformers /examples /legacy /seq2seq /romanian_postprocessing.md

add transformers

455a40f over 1 year ago

1.87 kB

	### Motivation
	Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
	With post processing, it can score 37..
	Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758)



	### Instructions
	Note: You need to have your test_generations.txt before you start this process.
	(1) Setup `mosesdecoder` and `wmt16-scripts`
	```bash
	cd $HOME
	git clone git@github.com:moses-smt/mosesdecoder.git
	cd mosesdecoder
	git clone git@github.com:rsennrich/wmt16-scripts.git
	```

	(2) define a function for post processing.
	It removes diacritics and does other things I don't understand
	```bash
	ro_post_process () {
	sys=$1
	ref=$2
	export MOSES_PATH=$HOME/mosesdecoder
	REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl
	NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl
	REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl
	REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py
	NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py
	TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl



	lang=ro
	for file in $sys $ref; do
	cat $file \
	\| $REPLACE_UNICODE_PUNCT \
	\| $NORM_PUNC -l $lang \
	\| $REM_NON_PRINT_CHAR \
	\| $NORMALIZE_ROMANIAN \
	\| $REMOVE_DIACRITICS \
	\| $TOKENIZER -no-escape -l $lang \
	> $(basename $file).tok
	done
	# compute BLEU
	cat $(basename $sys).tok \| sacrebleu -tok none -s none -b $(basename $ref).tok
	}
	```

	(3) Call the function on test_generations.txt and test.target
	For example,
	```bash
	ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
	```
	This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs.









	```