James-WYang
/

BigTranslate

@@ -1,18 +1,8 @@
 ---
 license: lgpl-3.0
 ---
-**BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages**
-### Large-scale Parallel Dataset Construction
-In order to enhance the language capabilities of the Chinese LLaMA model to support 102 languages, we constructed a comprehensive parallel corpus dataset consisting of 102 languages. This dataset was employed to continue training the foundational model. The compilation of this dataset drew upon multiple sources, including widely available public parallel corpus datasets and household datasets. The public datasets utilized in our study contain IWSLT, WMT, CCMT, and OPUS-100, forming the initial corpus of our dataset.
-To effectively illustrate the distribution of the corpus, we present a visual representation of the language-pair distribution within the multilingual datasets. The matter pertaining to the imbalance between high-resource and low-resource language pairs continues to be a prominent concern within the current corpus.
-### Incremental Multilingual Pre-training
-In this incremental pre-training method, we gradually expose the model to language pairs in a curriculum-like manner. Initially, the model is exposed to high-resource language pairs, allowing it to establish a solid foundation in those languages. Subsequently, we progressively introduce low-resource language pairs, enabling the model to gradually expand its knowledge and proficiency in these languages.
-Specifically, we follow a three-step approach in our incremental pre-training method. Firstly, we set the sample interval size and divide language pairs into distinct intervals based on the number of instances for each language pair. Secondly, we calculate the sample mean for all language pairs in each interval. Thirdly, we dynamically measure the moment of adding the language-pair samples next interval according to the sample mean in the previous sample interval. In the following part, we detail the three steps.
-### Experiments
-To verify the effectiveness of our BigTrans model, we conduct preliminary multilingual translation experiments on all 102 languages. We compare BigTrans with both Google Translate and ChatGPT. Since the automatic evaluation metric BLEU is usually criticized for the poor correlation with human judgments in machine translation quality, we further employ GPT-4 which shows a high correlation with human as the evaluator and we design well-defined prompts to make GPT-4 act like a human evaluator. The experiments show that BigTrans performs comparably with Google and ChatGPT in many languages, and even outperforms ChatGPT in 8 language pairs.
 **More Details can be found at https://github.com/ZNLP/BigTrans and https://arxiv.org/abs/2305.18098**

 ---
 license: lgpl-3.0
 ---
+# BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages
+Large language models (LLMs) demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTrans which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTrans is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTrans model. The preliminary experiments on multilingual translation show that BigTrans performs comparably with
+ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTrans model and hope it can advance the research progress.
 **More Details can be found at https://github.com/ZNLP/BigTrans and https://arxiv.org/abs/2305.18098**