nguyenvulebinh commited on
Commit
a174065
1 Parent(s): ea69391

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - en
5
+
6
+ license: mit
7
+ tags:
8
+ - mbart-50
9
+ ---
10
+
11
+ # mBART-50
12
+
13
+ mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper.
14
+
15
+ ## Model description
16
+
17
+ mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning.
18
+ Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.
19
+
20
+ **Multilingual Denoising Pretraining**: The model incorporates N languages by concatenating data:
21
+ `D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes,
22
+ first randomly shuffling the original sentences' order, and second a novel in-filling scheme,
23
+ where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text.
24
+ 35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(λ = 3.5)`.
25
+ The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.
26
+
27
+
28
+ ## Checking
29
+
30
+
31
+ ```python
32
+ import torch
33
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
34
+
35
+
36
+ model = AutoModelForSeq2SeqLM.from_pretrained('facebook/mbart-large-50')
37
+ tokenizer = AutoTokenizer.from_pretrained('facebook/mbart-large-50')
38
+
39
+ src_text = "UN Chief Says There Is <mask> Military Solution <mask> Syria"
40
+ encoded_hi = tokenizer(src_text, return_tensors="pt")
41
+ generated_output = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"],
42
+ return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
43
+ text_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)
44
+
45
+
46
+ new_model = AutoModelForSeq2SeqLM.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
47
+ new_tokenizer = AutoTokenizer.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
48
+ new_encoded_hi = new_tokenizer(src_text, return_tensors="pt")
49
+ new_generated_output = new_model.generate(**new_encoded_hi, forced_bos_token_id=new_tokenizer.lang_code_to_id["en_XX"],
50
+ return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
51
+ new_text_output = new_tokenizer.batch_decode(new_generated_output.sequences, skip_special_tokens=True)
52
+
53
+ assert text_output == new_text_output
54
+ assert torch.equal(generated_output.encoder_hidden_states[-1], new_generated_output.encoder_hidden_states[-1])
55
+ assert torch.equal(generated_output.decoder_hidden_states[-1][-1], new_generated_output.decoder_hidden_states[-1][-1])
56
+
57
+ ```
58
+
59
+
60
+
61
+ ## Languages covered
62
+ English (en_XX)
63
+
64
+
65
+ ## BibTeX entry and citation info
66
+ ```
67
+ @article{tang2020multilingual,
68
+ title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
69
+ author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
70
+ year={2020},
71
+ eprint={2008.00401},
72
+ archivePrefix={arXiv},
73
+ primaryClass={cs.CL}
74
+ }
75
+ ```