Yiran0924 commited on
Commit
ed1a5e0
·
verified ·
1 Parent(s): c2581be

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: seallm
4
+ license_link: https://huggingface.co/SeaLLMs/SeaLLM-13B-Chat/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ - zh
8
+ - hi
9
+ - es
10
+ - fr
11
+ - ar
12
+ - bn
13
+ - ru
14
+ - pt
15
+ - id
16
+ - ur
17
+ - de
18
+ - ja
19
+ - sw
20
+ - ta
21
+ - tr
22
+ - ko
23
+ - vi
24
+ - jv
25
+ - it
26
+ - ha
27
+ - th
28
+ - fa
29
+ - tl
30
+ - my
31
+ tags:
32
+ - multilingual
33
+ - babel
34
+ ---
35
+
36
+
37
+ # *Babel*: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
38
+
39
+ <p align="center">
40
+ <a href="https://babel-llm.github.io/babel-llm/" target="_blank" rel="noopener">Website</a>
41
+ &nbsp;&nbsp;
42
+ <a href="https://huggingface.co/Tower-Babel/Babel-9B/" target="_blank" rel="noopener">Model</a>
43
+ &nbsp;&nbsp;
44
+ <a href="https://github.com/babel-llm/babel-llm" target="_blank" rel="noopener">Github</a>
45
+ &nbsp;&nbsp;
46
+ <a href="https://github.com/babel-llm/babel-llm/blob/main/paper/babel.pdf" target="_blank" rel="noopener">Technical Report</a>
47
+ </p>
48
+
49
+ ## Introduction
50
+
51
+ We introduce **Babel**, a multilingual LLM that covers the top 25 languages by number of speakers, including English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese. These 25 languages support over 90% of the global population, and include many languages neglected by other open multilingual LLMs. Unlike traditional continued pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling.
52
+
53
+ We introduce two variants:
54
+ - **Babel-9B**, designed for efficient single-GPU inference and fine-tuning
55
+ - **Babel-83B**, which sets a new standard for open multilingual LLMs
56
+
57
+ Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using existing supervised fine-tuning datasets, Babel achieves remarkable performance, with **Babel-9B-Chat** leading among 10B-sized LLMs and **Babel-83B-Chat** setting a new standard for open LLMs, performing comparably to GPT-4o on certain tasks.
58
+
59
+ This page introduces the **Babel-9B-Base** model
60
+
61
+ ## Supervised Fine-Tuning (SFT) Data
62
+
63
+ We primarily leverage open-source multilingual SFT training corpora and translated SFT training data. Specifically, we utilize WildChat, a dataset comprising 1 million user-ChatGPT conversations with over 2.5 million interaction turns. Additionally, we employ Everything Instruct Multilingual, an extensive Alpaca-instruct-formatted dataset covering a diverse range of topics.
64
+
65
+
66
+ ## Evaluation
67
+
68
+ We employ multilingual tasks across several categories:
69
+
70
+ 1. **World Knowledge:**
71
+ MMMLU ([OpenAI 2024](https://huggingface.co/datasets/openai/MMMLU)), a human-translated version of MMLU ([Hendrycks et al. 2021](https://arxiv.org/abs/2009.03300)) available in 14 languages. For languages not covered, we use Google Translate ([Google Translate API](https://translate.google.com/)) to generate translations. Additionally, we include M3Exam ([M3Exam](https://arxiv.org/abs/2306.05179)), which consists of authentic human exam questions collected from various countries, covering multiple subjects and educational levels.
72
+
73
+ 2. **Reasoning:**
74
+ MGSM ([Shi et al. 2022](https://arxiv.org/abs/2210.03057)) and XCOPA ([Ponti et al. 2020](https://aclanthology.org/2020.emnlp-main.185/)).
75
+
76
+ 3. **Understanding:**
77
+ XNLI ([Conneau et al. 2018](https://arxiv.org/abs/1809.05053)).
78
+
79
+ 4. **Translation:**
80
+ Flores-200 ([NLLB Team 2022](https://arxiv.org/abs/2207.04672)).
81
+
82
+
83
+
84
+ ### Get started with `Transformers`
85
+
86
+ To quickly try the model, we show how to conduct inference with `transformers` below. Make sure you have installed the latest transformers version (>4.40).
87
+
88
+ ```python
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+
91
+ device = "cuda" # the device to load the model onto
92
+
93
+ model = AutoModelForCausalLM.from_pretrained(
94
+ "Tower-Babel/Babel-9B-Chat",
95
+ torch_dtype=torch.bfloat16,
96
+ device_map=device
97
+ )
98
+ tokenizer = AutoTokenizer.from_pretrained("Tower-Babel/Babel-9B-Chat")
99
+
100
+ # prepare messages to model
101
+ prompt = "Hiii How are you?"
102
+ messages = [
103
+ {"role": "system", "content": "You are a helpful assistant."},
104
+ {"role": "user", "content": prompt}
105
+ ]
106
+
107
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
108
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
109
+ print(f"Formatted text:\n {text}")
110
+ print(f"Model input:\n {model_inputs}")
111
+
112
+ generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True, eos_token_id=tokenizer.eos_token_id)
113
+ generated_ids = [
114
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
115
+ ]
116
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
117
+
118
+ print(f"Response:\n {response[0]}")
119
+ ```
120
+
121
+ ### Performance of 10B-Size Instruct Models vs. Babel-9B-Chat
122
+
123
+ | **Dataset** | **GLM4-9B** | **Gemma2-9B** | **Mistral-12B** | **Llama3.1-8B** | **Qwen2.5-7B** | **Babel-9B** |
124
+ |-------------------|-------------|---------------|-----------------|-----------------|----------------|--------------|
125
+ | MMMLU | 53.9 | 59.6 | 52.0 | 50.6 | 56.0 | **59.8** |
126
+ | M3Exam | 55.0 | **63.2** | 54.1 | 54.2 | 58.0 | 62.9 |
127
+ | XCOPA | 86.2 | 87.4 | 83.5 | 82.1 | 80.4 | **88.9** |
128
+ | MGSM | 52.2 | 62.4 | 41.4 | 37.2 | 59.1 | **64.3** |
129
+ | XNLI | 66.2 | 66.7 | 56.1 | 55.8 | 68.3 | **72.4** |
130
+ | Flores-200 | 50.8 | 54.8 | 48.9 | 47.3 | 45.8 | **56.7** |
131
+ | *Average* | 60.7 | 65.7 | 56.0 | 54.5 | 61.3 | **67.5** |
132
+
133
+ **Note that results are achieved purely by leveraging publicly available datasets, showcasing the robust foundational performance of Babel base models. We believe that incorporating more SFT data across diverse types, domains, and formats, along with additional alignment data and preference tuning, will further enhance the chat version beyond its current capabilities.**
134
+
135
+ ## Acknowledgement
136
+ We would like to thank Guanzheng Chen for assisting with the implementation of the training codebase. Our special thanks go to our professional and native linguists—Tantong Champaiboon, Nguyen Ngoc Yen Nhi, and Tara Devina Putri—who contributed to building, evaluating, and fact-checking our sampled pretraining dataset. We also appreciate Fan Wang, Jiasheng Tang, Xin Li, and Hao Zhang for their efforts in coordinating computing resources.
137
+
138
+ ## Citation
139
+
140
+ If you find our project useful, we hope you would kindly star our repo and cite our work as follows:
141
+ ```
142
+ @article{babel2025,
143
+ author = {Yiran Zhao and Chaoqun Liu and Yue Deng and Jiahao Ying and Mahani Aljunied and Zhaodonghui Li and Lidong Bing and Hou Pong Chan and Yu Rong and Deli Zhao and Wenxuan Zhang},
144
+ title = {Babel: Open Multilingual Large Language Models Serving Over 90\% of Global Speakers},
145
+ year = {2025},
146
+ note = {Available online: \url{https://github.com/babel-llm/babel-llm/blob/main/paper/babel.pdf}}
147
+ }
148
+ ```
149
+ Corresponding Author: wxzhang@sutd.edu.sg