Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# T5-EN-VI-BASE:Pretraining Text-To-Text Transfer Transformer for English Vietnamese Translation
|
2 |
+
|
3 |
+
# Dataset
|
4 |
+
|
5 |
+
The *IWSLT'15 English-Vietnamese* data is used from [Stanford NLP group](https://nlp.stanford.edu/projects/nmt/).
|
6 |
+
|
7 |
+
For all experiments the corpus was split into training, development and test set:
|
8 |
+
|
9 |
+
| Data set | Sentences | Download
|
10 |
+
| ----------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------
|
11 |
+
| Training | 133,317 | via [GitHub](https://github.com/stefan-it/nmt-en-vi/raw/master/data/train-en-vi.tgz) or located in `data/train-en-vi.tgz`
|
12 |
+
| Development | 1,553 | via [GitHub](https://github.com/stefan-it/nmt-en-vi/raw/master/data/dev-2012-en-vi.tgz) or located in `data/dev-2012-en-vi.tgz`
|
13 |
+
| Test | 1,268 | via [GitHub](https://github.com/stefan-it/nmt-en-vi/raw/master/data/test-2013-en-vi.tgz) or located in `data/test-2013-en-vi.tgz`
|
14 |
+
|
15 |
+
|
16 |
+
## Results
|
17 |
+
|
18 |
+
The results on test set.
|
19 |
+
|
20 |
+
| Model | BLEU (Beam Search)
|
21 |
+
| ----------------------------------------------------------------------------------------------------- | ------------------
|
22 |
+
| [Luong & Manning (2015)](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | 23.30
|
23 |
+
| Sequence-to-sequence model with attention | 26.10
|
24 |
+
| Neural Phrase-based Machine Translation [Huang et. al. (2017)](https://arxiv.org/abs/1706.05565) | 27.69
|
25 |
+
| Neural Phrase-based Machine Translation + LM [Huang et. al. (2017)](https://arxiv.org/abs/1706.05565) | 28.07
|
26 |
+
| t5-en-vi-small (pretraining, without training data) | **28.46** (cased) / **29.23** (uncased)
|
27 |
+
|t5-en-vi-small (fineturning with training data) | **32.12** (cased) / **32.92** (uncased)
|
28 |
+
|
29 |
+
#### Example Using
|
30 |
+
|
31 |
+
``` bash
|
32 |
+
import torch
|
33 |
+
|
34 |
+
from transformers import MT5ForConditionalGeneration, T5Tokenizer
|
35 |
+
import torch
|
36 |
+
if torch.cuda.is_available():
|
37 |
+
device = torch.device("cuda")
|
38 |
+
|
39 |
+
print('There are %d GPU(s) available.' % torch.cuda.device_count())
|
40 |
+
|
41 |
+
print('We will use the GPU:', torch.cuda.get_device_name(0))
|
42 |
+
else:
|
43 |
+
print('No GPU available, using the CPU instead.')
|
44 |
+
device = torch.device("cpu")
|
45 |
+
|
46 |
+
model = MT5ForConditionalGeneration.from_pretrained("NlpHUST/t5-en-vi-small")
|
47 |
+
tokenizer = T5Tokenizer.from_pretrained("NlpHUST/t5-en-vi-small")
|
48 |
+
model.to(device)
|
49 |
+
|
50 |
+
src = "In school , we spent a lot of time studying the history of Kim Il-Sung , but we never learned much about the outside world , except that America , South Korea , Japan are the enemies ."
|
51 |
+
tokenized_text = tokenizer.encode(src, return_tensors="pt").to(device)
|
52 |
+
model.eval()
|
53 |
+
summary_ids = model.generate(
|
54 |
+
tokenized_text,
|
55 |
+
max_length=128,
|
56 |
+
num_beams=5,
|
57 |
+
repetition_penalty=2.5,
|
58 |
+
length_penalty=1.0,
|
59 |
+
early_stopping=True
|
60 |
+
)
|
61 |
+
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
62 |
+
print(output)
|
63 |
+
```
|
64 |
+
#### Output
|
65 |
+
|
66 |
+
``` bash
|
67 |
+
|
68 |
+
Ở trường, chúng tôi dành nhiều thời gian để nghiên cứu về lịch sử Kim Il-Sung, nhưng chúng tôi chưa bao giờ học được nhiều về thế giới bên ngoài, ngoại trừ Mỹ, Hàn Quốc, Nhật Bản là kẻ thù.
|
69 |
+
|
70 |
+
```
|