File size: 3,665 Bytes
a174065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8ec713
 
 
a174065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language: 
- multilingual
- en

license: mit
tags:
- mbart-50
---

# mBART-50

mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper.

## Model description

mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning. 
Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.

**Multilingual Denoising Pretraining**: The model incorporates N languages by concatenating data: 
`D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes, 
first randomly shuffling the original sentences' order, and second a novel in-filling scheme, 
where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text. 
35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(λ = 3.5)`.
The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence.


## Checking


```python
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


model = AutoModelForSeq2SeqLM.from_pretrained('facebook/mbart-large-50')
tokenizer = AutoTokenizer.from_pretrained('facebook/mbart-large-50')

src_text = "UN Chief Says There Is <mask> Military Solution <mask> Syria"
encoded_hi = tokenizer(src_text, return_tensors="pt")
generated_output = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"], 
                                  return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
text_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)


new_model = AutoModelForSeq2SeqLM.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
new_tokenizer = AutoTokenizer.from_pretrained('nguyenvulebinh/mbart-large-50-latin-only')
new_encoded_hi = new_tokenizer(src_text, return_tensors="pt")
new_generated_output = new_model.generate(**new_encoded_hi, forced_bos_token_id=new_tokenizer.lang_code_to_id["en_XX"], 
                                          return_dict_in_generate=True, return_dict=True, output_hidden_states=True)
new_text_output = new_tokenizer.batch_decode(new_generated_output.sequences, skip_special_tokens=True)

assert text_output == new_text_output
assert torch.equal(generated_output.encoder_hidden_states[-1], new_generated_output.encoder_hidden_states[-1])
assert torch.equal(generated_output.decoder_hidden_states[-1][-1], new_generated_output.decoder_hidden_states[-1][-1])

print(new_text_output)
# ['UN Chief Says There Is  No Military Solution  to the War in Syria']

```



## Languages covered
English (en_XX)


## BibTeX entry and citation info
```
@article{tang2020multilingual,
    title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
    author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
    year={2020},
    eprint={2008.00401},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```