LLaMAX commited on
Commit
32d42ea
1 Parent(s): ec7f931

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -100
README.md CHANGED
@@ -1,100 +1,104 @@
1
- ### Model Sources
2
-
3
- - **Paper**: "LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages"
4
-
5
- - **Link**: https://arxiv.org/pdf/2407.05975
6
-
7
- - **Repository**: https://github.com/CONE-MT/LLaMAX/
8
-
9
- ### Model Description
10
-
11
- LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
12
-
13
- We collected extensive training sets in 102 languages for continued pre-training of Llama2 and leveraged the English instruction fine-tuning dataset, Alpaca, to fine-tune its instruction-following capabilities.
14
-
15
- ### 🔥 Effortless Multilingual Translation with a Simple Prompt
16
-
17
- LLaMAX supports translation between more than 100 languages, surpassing the performance of similarly scaled LLMs.
18
-
19
- ```angular2html
20
- def Prompt_template(query, src_language, trg_language):
21
- instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
22
- prompt = (
23
- 'Below is an instruction that describes a task, paired with an input that provides further context. '
24
- 'Write a response that appropriately completes the request.\n'
25
- f'### Instruction:\n{instruction}\n'
26
- f'### Input:\n{query}\n### Response:'
27
- )
28
- return prompt
29
- ```
30
-
31
- And then run the following codes to execute translation:
32
- ```angular2html
33
- from transformers import AutoTokenizer, LlamaForCausalLM
34
-
35
- model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
36
- tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
37
-
38
- query = "你好,今天是个好日子"
39
- prompt = Prompt_template(query, 'Chinese', 'English')
40
- inputs = tokenizer(prompt, return_tensors="pt")
41
-
42
- generate_ids = model.generate(inputs.input_ids, max_length=30)
43
- tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
44
- # => "Hello, today is a good day"
45
- ```
46
-
47
-
48
- ### 🔥 Excellent Translation Performance
49
- LLaMAX achieves an average spBLEU score improvement of over **10 points** compared to the LLaMA2-Alpaca model on the Flores-101 dataset.
50
-
51
- | System | Size | en-X (COMET) | en-X (BLEU) | zh-X (COMET)| zh-X (BLEU) | de-X (COMET) | de-X (BLEU) | ne-X (COMET) | ne-X (BLEU) |ar-X (COMET) | ar-X (BLEU) | az-X (COMET) | az-X (BLEU) | ceb-X (COMET) | ceb-X (BLEU)|
52
- |--------------------|------|--------------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
53
- | LLaMAX2-7B-Alpaca | 7B | 52.83 | 9.44 | 51.29 | 3.80 | 51.47 | 6.82 | 46.59 | 1.31 | 46.76 | 2.84 | 48.63 | 1.36 | 41.02 | 2.69 |
54
- | LLaMAX2-7B-Alpaca | 13B | 57.16 | 11.85 | 53.93 | 6.25 | 54.70 | 9.42 | 51.47 | 3.11 | 50.73 | 5.23 | 50.68 | 2.74 | 47.86 | 4.96 |
55
- | LLaMAX2-7B-Alpaca | 7B | 76.66 | 23.17 | 73.54 | 14.17 | 73.82 | 18.96 | 74.64 | 14.49 | 72.00 | 15.82 | 70.91 | 11.34 | 68.67 | 15.53 |
56
-
57
-
58
- | System | Size | X-en (COMET) | X-en (BLEU) | X-zh (COMET)| X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) |X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
59
- |---------------|------|----------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |--------------|
60
- | LLaMAX2-7B-Alpaca | 7B |65.85| 16.44 | 56.53 | 4.46 | 56.76 | 9.01 | 34.96 | 1.03 | 44.10 | 2.18 | 40.67 | 0.63 | 45.69 | 1.73 |
61
- | LLaMAX2-7B-Alpaca | 13B | 68.72| 19.69 | 64.46| 8.80| 62.86| 12.57| 38.88| 2.16| 52.08| 4.48| 41.18| 0.87| 48.47| 2.51|
62
- | LLaMAX2-7B-Alpaca| 7B | 80.55 | 30.63 | 75.52 | 13.53 | 74.47 | 19.26 | 67.36 | 15.47 | 75.40 | 15.32 | 72.03 | 10.27 | 65.05| 16.11|
63
-
64
-
65
- ### 🔥 Effective Base Model for Multilingual Task
66
-
67
- LLaMAX preserves its efficacy in general tasks and improves the performance on multilingual tasks.
68
- We fine-tuned LLaMAX using only the English training set of downstream task, which also shows significant improvements in non-English. We provide fine-tuning LLaMAX models for the following three tasks:
69
-
70
- - **Math Reasoning**: https://huggingface.co/LLaMAX/LLaMAX2-7B-MetaMath
71
-
72
- - **Commonsense Reasoning**: https://huggingface.co/LLaMAX/LLaMAX2-7B-X-CSQA
73
-
74
- - **Natural Language Inference**: https://huggingface.co/LLaMAX/LLaMAX2-7B-XNLI
75
-
76
- ### Supported Languages
77
- Akrikaans (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
78
-
79
- ### Model Index
80
- We implement multiple versions of the LLaMAX model, the model links are as follows:
81
-
82
- | Model | LLaMAX | LLaMAX-Alpaca |
83
- |---------|----------------------------------------------------------|-----------------------------------------------------------------|
84
- | Llama-2 | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B) | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B-Alpaca) |
85
- | Llama-3 | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B) | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-Alpaca) |
86
-
87
- ### Citation
88
- If our model helps your work, please cite this paper:
89
-
90
- ```
91
- @misc{lu2024llamaxscalinglinguistichorizons,
92
- title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
93
- author={Yinquan Lu and Wenhao Zhu and Lei Li and Yu Qiao and Fei Yuan},
94
- year={2024},
95
- eprint={2407.05975},
96
- archivePrefix={arXiv},
97
- primaryClass={cs.CL},
98
- url={https://arxiv.org/abs/2407.05975},
99
- }
100
- ```
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - Multilingual
4
+ ---
5
+ ### Model Sources
6
+
7
+ - **Paper**: "LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages"
8
+
9
+ - **Link**: https://arxiv.org/pdf/2407.05975
10
+
11
+ - **Repository**: https://github.com/CONE-MT/LLaMAX/
12
+
13
+ ### Model Description
14
+
15
+ LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
16
+
17
+ We collected extensive training sets in 102 languages for continued pre-training of Llama2 and leveraged the English instruction fine-tuning dataset, Alpaca, to fine-tune its instruction-following capabilities.
18
+
19
+ ### 🔥 Effortless Multilingual Translation with a Simple Prompt
20
+
21
+ LLaMAX supports translation between more than 100 languages, surpassing the performance of similarly scaled LLMs.
22
+
23
+ ```angular2html
24
+ def Prompt_template(query, src_language, trg_language):
25
+ instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
26
+ prompt = (
27
+ 'Below is an instruction that describes a task, paired with an input that provides further context. '
28
+ 'Write a response that appropriately completes the request.\n'
29
+ f'### Instruction:\n{instruction}\n'
30
+ f'### Input:\n{query}\n### Response:'
31
+ )
32
+ return prompt
33
+ ```
34
+
35
+ And then run the following codes to execute translation:
36
+ ```angular2html
37
+ from transformers import AutoTokenizer, LlamaForCausalLM
38
+
39
+ model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
40
+ tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
41
+
42
+ query = "你好,今天是个好日子"
43
+ prompt = Prompt_template(query, 'Chinese', 'English')
44
+ inputs = tokenizer(prompt, return_tensors="pt")
45
+
46
+ generate_ids = model.generate(inputs.input_ids, max_length=30)
47
+ tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
48
+ # => "Hello, today is a good day"
49
+ ```
50
+
51
+
52
+ ### 🔥 Excellent Translation Performance
53
+ LLaMAX achieves an average spBLEU score improvement of over **10 points** compared to the LLaMA2-Alpaca model on the Flores-101 dataset.
54
+
55
+ | System | Size | en-X (COMET) | en-X (BLEU) | zh-X (COMET)| zh-X (BLEU) | de-X (COMET) | de-X (BLEU) | ne-X (COMET) | ne-X (BLEU) |ar-X (COMET) | ar-X (BLEU) | az-X (COMET) | az-X (BLEU) | ceb-X (COMET) | ceb-X (BLEU)|
56
+ |--------------------|------|--------------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
57
+ | LLaMAX2-7B-Alpaca | 7B | 52.83 | 9.44 | 51.29 | 3.80 | 51.47 | 6.82 | 46.59 | 1.31 | 46.76 | 2.84 | 48.63 | 1.36 | 41.02 | 2.69 |
58
+ | LLaMAX2-7B-Alpaca | 13B | 57.16 | 11.85 | 53.93 | 6.25 | 54.70 | 9.42 | 51.47 | 3.11 | 50.73 | 5.23 | 50.68 | 2.74 | 47.86 | 4.96 |
59
+ | LLaMAX2-7B-Alpaca | 7B | 76.66 | 23.17 | 73.54 | 14.17 | 73.82 | 18.96 | 74.64 | 14.49 | 72.00 | 15.82 | 70.91 | 11.34 | 68.67 | 15.53 |
60
+
61
+
62
+ | System | Size | X-en (COMET) | X-en (BLEU) | X-zh (COMET)| X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) |X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
63
+ |---------------|------|----------------|-------------| ----| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |--------------|
64
+ | LLaMAX2-7B-Alpaca | 7B |65.85| 16.44 | 56.53 | 4.46 | 56.76 | 9.01 | 34.96 | 1.03 | 44.10 | 2.18 | 40.67 | 0.63 | 45.69 | 1.73 |
65
+ | LLaMAX2-7B-Alpaca | 13B | 68.72| 19.69 | 64.46| 8.80| 62.86| 12.57| 38.88| 2.16| 52.08| 4.48| 41.18| 0.87| 48.47| 2.51|
66
+ | LLaMAX2-7B-Alpaca| 7B | 80.55 | 30.63 | 75.52 | 13.53 | 74.47 | 19.26 | 67.36 | 15.47 | 75.40 | 15.32 | 72.03 | 10.27 | 65.05| 16.11|
67
+
68
+
69
+ ### 🔥 Effective Base Model for Multilingual Task
70
+
71
+ LLaMAX preserves its efficacy in general tasks and improves the performance on multilingual tasks.
72
+ We fine-tuned LLaMAX using only the English training set of downstream task, which also shows significant improvements in non-English. We provide fine-tuning LLaMAX models for the following three tasks:
73
+
74
+ - **Math Reasoning**: https://huggingface.co/LLaMAX/LLaMAX2-7B-MetaMath
75
+
76
+ - **Commonsense Reasoning**: https://huggingface.co/LLaMAX/LLaMAX2-7B-X-CSQA
77
+
78
+ - **Natural Language Inference**: https://huggingface.co/LLaMAX/LLaMAX2-7B-XNLI
79
+
80
+ ### Supported Languages
81
+ Akrikaans (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
82
+
83
+ ### Model Index
84
+ We implement multiple versions of the LLaMAX model, the model links are as follows:
85
+
86
+ | Model | LLaMAX | LLaMAX-Alpaca |
87
+ |---------|----------------------------------------------------------|-----------------------------------------------------------------|
88
+ | Llama-2 | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B) | [Link](https://huggingface.co/LLaMAX/LLaMAX2-7B-Alpaca) |
89
+ | Llama-3 | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B) | [Link](https://huggingface.co/LLaMAX/LLaMAX3-8B-Alpaca) |
90
+
91
+ ### Citation
92
+ If our model helps your work, please cite this paper:
93
+
94
+ ```
95
+ @misc{lu2024llamaxscalinglinguistichorizons,
96
+ title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
97
+ author={Yinquan Lu and Wenhao Zhu and Lei Li and Yu Qiao and Fei Yuan},
98
+ year={2024},
99
+ eprint={2407.05975},
100
+ archivePrefix={arXiv},
101
+ primaryClass={cs.CL},
102
+ url={https://arxiv.org/abs/2407.05975},
103
+ }
104
+ ```