samuelcahyawijaya commited on
Commit
a8f2ab8
1 Parent(s): ccd30de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +304 -0
README.md CHANGED
@@ -1,3 +1,307 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - id
5
+ - su
6
+ - jv
7
  ---
8
+ # **Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages**
9
+ Cendol is an open-source collection of fine-tuned generative large language models in Indonesian languages covering decoder-only and encoder-decoder transformer model architectures ranging in scale from 300 million to 13 billion parameters.
10
+
11
+ This is the repository for the **580M Cendol mT5-base Instruct model**. Links to other models can be found below.
12
+
13
+ ## Model Details
14
+ *Note*: Use of Cendol is licensed under the [Apache 2.0 license](https://choosealicense.com/licenses/apache-2.0/)
15
+
16
+ **Overview**
17
+
18
+ IndoNLP developed and publicly released the Cendol family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 560 million to 13 billion parameters.
19
+
20
+ Cendol models cover two instruction-tuned versions:
21
+ 1. Cendol-Instruct that is instruction-tuned on tasks-specific NLP data such as sentiment analysis, topic modeling, machine translation, summarization, question answering, paraphrasing, etc
22
+ 2. Cendol-Chat that is continuously instruction-tuned from **Cendol-Instruct** on general knowledge and human-centric prompts.
23
+
24
+ Both Cendol-Instruct and Cendol-Chat are designed for a single-turn conversation. Cendol outperforms open-source multilingual and region-specific LLMs on most benchmarks we tested by a huge margin, with the smaller version (<1B parameters) of Cendol being highly competitive with other LLMs with 7B parameters.
25
+
26
+ **Model Developers**: IndoNLP
27
+
28
+ **Variations**
29
+
30
+ Cendol comes from 2 base models (mT5 and LLaMA-2) each with a range of parameter sizes. mT5-based Cendol comes with 300M (mT5-small), 580M (mT5-base), 1.2B (mT5-large), 3.7B (mT5-XL), and 13B (mT5-XXL) models, while LLaMA-2-based Cendol comes with 7B (LLaMA2-7B) and 13B (LLaMA2-13B) models. Both variants come with Cendol-Instruct and Cendol-Chat variations. All 13B parameter models are tuned with LoRA, while others are fully fine-tuned.
31
+
32
+ In our paper, we showcase that adapting region-specific LLMs using LoRA is ineffective and inefficient, i.e., the 13B (mT5-XXL) Cendol models perform slightly worse than the 1.2B (mT5-large) Cendol models, while having 3x slower training time and 4x slower inference time. As an alternative to LoRA, we showcase the benefits of vocabulary substitution as an effective and efficient strategy for region-specific adaptation, where we improve the efficiency by **11.50%** and **18.71%** for training and inference times, respectively.
33
+ In terms of evaluation performance, we also showcase that the model performs on par with the Cendol model trained with the original vocabulary. We also release the Indonesian vocabulary-adapted model denoted as `Indonesian-Vocab Instruct`.
34
+
35
+ **Input-Output**: Models input and output are text only.
36
+
37
+ **Model Architecture**
38
+
39
+ |Model|Training Data|Params|Tuning Strategy|LR|
40
+ |---|---|---|---|---|
41
+ |[Cendol mT5-small Instruct](https://huggingface.co/indonlp/cendol-mt5-small-inst)|[NusaT2T v1]()|300M|Fully-Finetuned|3.0 x 10<sup>-4</sup>|
42
+ |[Cendol mT5-base Instruct](https://huggingface.co/indonlp/cendol-mt5-base-inst)|[NusaT2T v1]()|580M|Fully-Finetuned|3.0 x 10<sup>-4</sup>|
43
+ |[Cendol mT5-large Instruct](https://huggingface.co/indonlp/cendol-mt5-large-inst)|[NusaT2T v1]()|1.2B|Fully-Finetuned|3.0 x 10<sup>-4</sup>|
44
+ |[Cendol mT5-xl Instruct](https://huggingface.co/indonlp/cendol-mt5-xl-inst)|[NusaT2T v1]()|3.7B|Fully-Finetuned|3.0 x 10<sup>-4</sup>|
45
+ |[Cendol mT5-xxl Instruct](https://huggingface.co/indonlp/cendol-mt5-xxl-merged-inst)|[NusaT2T v1]()|13B|LoRA|2.0 x 10<sup>-4</sup>|
46
+ |[Cendol LLaMA-2 (7B) Instruct](https://huggingface.co/indonlp/cendol-llama2-7b-inst)|[NusaT2T v1]()|7B|Fully-Finetuned|2.0 x 10<sup>-5</sup>|
47
+ |[Cendol LLaMA-2 (7B) Indonesian-Vocab Instruct](https://huggingface.co/indonlp/cendol-llama2-ind-vocab-inst)|[NusaT2T v1]()|7B|Fully-Finetuned|2.0 x 10<sup>-5</sup>|
48
+ |[Cendol LLaMA-2 (13B) Instruct](https://huggingface.co/indonlp/cendol-llama2-13b-merged-inst)|[NusaT2T v1]()|13B|LoRA|2.0 x 10<sup>-5</sup>|
49
+ |[Cendol mT5-small Chat](https://huggingface.co/indonlp/cendol-mt5-small-chat)|[NusaT2T v2]()|300M|Fully-Finetuned|3.0 x 10<sup>-5</sup>|
50
+ |[Cendol mT5-base Chat](https://huggingface.co/indonlp/cendol-mt5-base-chat)|[NusaT2T v2]()|580M|Fully-Finetuned|3.0 x 10<sup>-5</sup>|
51
+ |[Cendol mT5-large Chat](https://huggingface.co/indonlp/cendol-mt5-large-chat)|[NusaT2T v2]()|1.2B|Fully-Finetuned|3.0 x 10<sup>-5</sup>|
52
+ |[Cendol mT5-xl Chat](https://huggingface.co/indonlp/cendol-mt5-xl-chat)|[NusaT2T v2]()|3.7B|Fully-Finetuned|3.0 x 10<sup>-5</sup>|
53
+ |[Cendol mT5-xxl Chat](https://huggingface.co/indonlp/cendol-mt5-xxl-merged-chat)|[NusaT2T v2]()|13B|LoRA|2.0 x 10<sup>-4</sup>|
54
+ |[Cendol LLaMA-2 (7B) Chat](https://huggingface.co/indonlp/cendol-llama2-7b-chat)|[NusaT2T v2]()|7B|Fully-Finetuned|1.0 x 10<sup>-5</sup>|
55
+ |[Cendol LLaMA-2 (13B) Chat](https://huggingface.co/indonlp/cendol-llama2-13b-merged-chat)|[NusaT2T v2]()|13B|LoRA|2.0 x 10<sup>-4</sup>|
56
+
57
+ **Model Dates** Cendol was trained between October 2023 and January 2024.
58
+
59
+ **License** Use of Cendol is licensed under the [Apache 2.0 license](https://choosealicense.com/licenses/apache-2.0/)
60
+
61
+ **Research Paper** ["Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages"](https://arxiv.org/abs/2404.06138)
62
+
63
+ ## Intended Use
64
+ **Intended Use Cases** Cendol is intended for research use especially on Indonesian languages. Cendol models are intended for a single turn instruction, with Cendol-Instruct models can be used for task-specific instruction, while Cendol-Chat models can be used for general knowledge instruction.
65
+
66
+ **Out-of-scope Uses** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English and Indonesian languages. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Cendol.
67
+
68
+ ## Evaluation Results
69
+
70
+ In this section, we report the results for the Cendol models on large-scale NLU and NLG benchmarks. For all the evaluations, we use our internal evaluations library.
71
+
72
+ #### NLU Performance
73
+ <img width="938" alt="NLU Performance" src="https://github.com/IndoNLP/indo-t0/assets/2826602/7656f005-f261-4982-ad06-f18dc57d5e3b">
74
+
75
+ #### NLG Performance
76
+ <img width="940" alt="NLG Performance" src="https://github.com/IndoNLP/indo-t0/assets/2826602/4942caea-35df-44e1-a95b-53a027c6115f">
77
+
78
+ #### Human evaluation
79
+ <img width="456" alt="Human Evaluation" src="https://github.com/IndoNLP/indo-t0/assets/2826602/6128257f-d36c-4dbb-8f6c-4b936bc2ea66">
80
+
81
+
82
+ ## Ethical Considerations and Limitations
83
+ Cendol is a new technology that carries risks with its use. Testing conducted to date has been in Indonesian, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Cendol’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Cendol, developers should perform safety testing and tuning tailored to their specific applications of the model.
84
+
85
+ ## Citation
86
+ If you are using any resources including Cendol models, code, or data, please cite the following articles:
87
+ ```
88
+ @misc{cahyawijaya-etal-2024-cendol,
89
+ title={Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages},
90
+ author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
91
+ year={2024},
92
+ eprint={2404.06138},
93
+ archivePrefix={arXiv},
94
+ primaryClass={cs.CL}
95
+ }
96
+
97
+ @inproceedings{cahyawijaya-etal-2023-nusacrowd,
98
+ title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
99
+ author = "Cahyawijaya, Samuel and
100
+ Lovenia, Holy and
101
+ Aji, Alham Fikri and
102
+ Winata, Genta and
103
+ Wilie, Bryan and
104
+ Koto, Fajri and
105
+ Mahendra, Rahmad and
106
+ Wibisono, Christian and
107
+ Romadhony, Ade and
108
+ Vincentio, Karissa and
109
+ Santoso, Jennifer and
110
+ Moeljadi, David and
111
+ Wirawan, Cahya and
112
+ Hudi, Frederikus and
113
+ Wicaksono, Muhammad Satrio and
114
+ Parmonangan, Ivan and
115
+ Alfina, Ika and
116
+ Putra, Ilham Firdausi and
117
+ Rahmadani, Samsul and
118
+ Oenang, Yulianti and
119
+ Septiandri, Ali and
120
+ Jaya, James and
121
+ Dhole, Kaustubh and
122
+ Suryani, Arie and
123
+ Putri, Rifki Afina and
124
+ Su, Dan and
125
+ Stevens, Keith and
126
+ Nityasya, Made Nindyatama and
127
+ Adilazuarda, Muhammad and
128
+ Hadiwijaya, Ryan and
129
+ Diandaru, Ryandito and
130
+ Yu, Tiezheng and
131
+ Ghifari, Vito and
132
+ Dai, Wenliang and
133
+ Xu, Yan and
134
+ Damapuspita, Dyah and
135
+ Wibowo, Haryo and
136
+ Tho, Cuk and
137
+ Karo Karo, Ichwanul and
138
+ Fatyanosa, Tirana and
139
+ Ji, Ziwei and
140
+ Neubig, Graham and
141
+ Baldwin, Timothy and
142
+ Ruder, Sebastian and
143
+ Fung, Pascale and
144
+ Sujaini, Herry and
145
+ Sakti, Sakriani and
146
+ Purwarianti, Ayu",
147
+ editor = "Rogers, Anna and
148
+ Boyd-Graber, Jordan and
149
+ Okazaki, Naoaki",
150
+ booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
151
+ month = jul,
152
+ year = "2023",
153
+ address = "Toronto, Canada",
154
+ publisher = "Association for Computational Linguistics",
155
+ url = "https://aclanthology.org/2023.findings-acl.868",
156
+ doi = "10.18653/v1/2023.findings-acl.868",
157
+ pages = "13745--13818"
158
+ }
159
+ ```
160
+
161
+ Additionally, if you are inspired by our work on region-specific language models especially for Indonesian and its local languages, please also consider citing the following articles:
162
+ ```
163
+ @inproceedings{cahyawijaya-etal-2023-nusawrites,
164
+ title = "{N}usa{W}rites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages",
165
+ author = "Cahyawijaya, Samuel and
166
+ Lovenia, Holy and
167
+ Koto, Fajri and
168
+ Adhista, Dea and
169
+ Dave, Emmanuel and
170
+ Oktavianti, Sarah and
171
+ Akbar, Salsabil and
172
+ Lee, Jhonson and
173
+ Shadieq, Nuur and
174
+ Cenggoro, Tjeng Wawan and
175
+ Linuwih, Hanung and
176
+ Wilie, Bryan and
177
+ Muridan, Galih and
178
+ Winata, Genta and
179
+ Moeljadi, David and
180
+ Aji, Alham Fikri and
181
+ Purwarianti, Ayu and
182
+ Fung, Pascale",
183
+ editor = "Park, Jong C. and
184
+ Arase, Yuki and
185
+ Hu, Baotian and
186
+ Lu, Wei and
187
+ Wijaya, Derry and
188
+ Purwarianti, Ayu and
189
+ Krisnadhi, Adila Alfa",
190
+ booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
191
+ month = nov,
192
+ year = "2023",
193
+ address = "Nusa Dua, Bali",
194
+ publisher = "Association for Computational Linguistics",
195
+ url = "https://aclanthology.org/2023.ijcnlp-main.60",
196
+ doi = "10.18653/v1/2023.ijcnlp-main.60",
197
+ pages = "921--945"
198
+ }
199
+
200
+ @inproceedings{winata-etal-2023-nusax,
201
+ title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
202
+ author = "Winata, Genta Indra and
203
+ Aji, Alham Fikri and
204
+ Cahyawijaya, Samuel and
205
+ Mahendra, Rahmad and
206
+ Koto, Fajri and
207
+ Romadhony, Ade and
208
+ Kurniawan, Kemal and
209
+ Moeljadi, David and
210
+ Prasojo, Radityo Eko and
211
+ Fung, Pascale and
212
+ Baldwin, Timothy and
213
+ Lau, Jey Han and
214
+ Sennrich, Rico and
215
+ Ruder, Sebastian",
216
+ editor = "Vlachos, Andreas and
217
+ Augenstein, Isabelle",
218
+ booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
219
+ month = may,
220
+ year = "2023",
221
+ address = "Dubrovnik, Croatia",
222
+ publisher = "Association for Computational Linguistics",
223
+ url = "https://aclanthology.org/2023.eacl-main.57",
224
+ doi = "10.18653/v1/2023.eacl-main.57",
225
+ pages = "815--834"
226
+ }
227
+
228
+ @inproceedings{aji-etal-2022-one,
229
+ title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
230
+ author = "Aji, Alham Fikri and
231
+ Winata, Genta Indra and
232
+ Koto, Fajri and
233
+ Cahyawijaya, Samuel and
234
+ Romadhony, Ade and
235
+ Mahendra, Rahmad and
236
+ Kurniawan, Kemal and
237
+ Moeljadi, David and
238
+ Prasojo, Radityo Eko and
239
+ Baldwin, Timothy and
240
+ Lau, Jey Han and
241
+ Ruder, Sebastian",
242
+ editor = "Muresan, Smaranda and
243
+ Nakov, Preslav and
244
+ Villavicencio, Aline",
245
+ booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
246
+ month = may,
247
+ year = "2022",
248
+ address = "Dublin, Ireland",
249
+ publisher = "Association for Computational Linguistics",
250
+ url = "https://aclanthology.org/2022.acl-long.500",
251
+ doi = "10.18653/v1/2022.acl-long.500",
252
+ pages = "7226--7249"
253
+ }
254
+
255
+ @inproceedings{cahyawijaya-etal-2021-indonlg,
256
+ title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
257
+ author = "Cahyawijaya, Samuel and
258
+ Winata, Genta Indra and
259
+ Wilie, Bryan and
260
+ Vincentio, Karissa and
261
+ Li, Xiaohong and
262
+ Kuncoro, Adhiguna and
263
+ Ruder, Sebastian and
264
+ Lim, Zhi Yuan and
265
+ Bahar, Syafri and
266
+ Khodra, Masayu and
267
+ Purwarianti, Ayu and
268
+ Fung, Pascale",
269
+ editor = "Moens, Marie-Francine and
270
+ Huang, Xuanjing and
271
+ Specia, Lucia and
272
+ Yih, Scott Wen-tau",
273
+ booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
274
+ month = nov,
275
+ year = "2021",
276
+ address = "Online and Punta Cana, Dominican Republic",
277
+ publisher = "Association for Computational Linguistics",
278
+ url = "https://aclanthology.org/2021.emnlp-main.699",
279
+ doi = "10.18653/v1/2021.emnlp-main.699",
280
+ pages = "8875--8898"
281
+ }
282
+
283
+ @inproceedings{wilie-etal-2020-indonlu,
284
+ title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
285
+ author = "Wilie, Bryan and
286
+ Vincentio, Karissa and
287
+ Winata, Genta Indra and
288
+ Cahyawijaya, Samuel and
289
+ Li, Xiaohong and
290
+ Lim, Zhi Yuan and
291
+ Soleman, Sidik and
292
+ Mahendra, Rahmad and
293
+ Fung, Pascale and
294
+ Bahar, Syafri and
295
+ Purwarianti, Ayu",
296
+ editor = "Wong, Kam-Fai and
297
+ Knight, Kevin and
298
+ Wu, Hua",
299
+ booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
300
+ month = dec,
301
+ year = "2020",
302
+ address = "Suzhou, China",
303
+ publisher = "Association for Computational Linguistics",
304
+ url = "https://aclanthology.org/2020.aacl-main.85",
305
+ pages = "843--857"
306
+ }
307
+ ```