Papers
arxiv:2210.05610

MTet: Multi-domain Translation for English and Vietnamese

Published on Oct 11, 2022
Authors:
,
,
,
,
,
,

Abstract

We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2210.05610 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.