Edit model card

Model Description

This HF repository hosts instruction fine-tuned multilingual BLOOM model using the parallel instruction dataset called Bactrain-X in 52 languages. We progressively add a language during instruction fine-tuning at each time, and train 52 models in total. Then, we evaluate those models in three multilingual benchmarks.

Please refer to our paper for more details.

  • Base model: BLOOM 7B1
  • Instruction languages: English, Chinese, Afrikaans, Arabic, Azerbaijani, Bengali, Czech, German, Spanish, Estonian, Farsi, Finnish, French, Galician, Gujarati, Hebrew, Hindi, Croatian, Indonesian, Italian, Japanese, Georgian, Kazakh, Khmer, Korean, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Burmese, Nepali, Dutch, Polish, Pashto, Portuguese, Romanian, Russian, Sinhala, Slovenian, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Xhosa
  • Instruction language codes: en, zh, af, ar, az, bn, cs, de, es, et, fa, fi, fr, gl, gu, he, hi, hr, id, it, ja, ka, kk, km, ko, lt, lv, mk, ml, mn, mr, my, ne, nl, pl, ps, pt, ro, ru, si, sl, sv, sw, ta, te, th, tl, tr, uk, ur, vi, xh
  • Training method: full-parameter fine-tuning.

Usage

The model checkpoint should be loaded using transformers library.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MaLA-LM/lucky52-bloom-7b1-no-52")
model = AutoModelForCausalLM.from_pretrained("MaLA-LM/lucky52-bloom-7b1-no-52")

Citation

@misc{lucky52,
  title         = "Lucky 52: How Many Languages Are Needed to Instruction Fine-Tune Large Language Models?",
  author        = "Shaoxiong Ji and Pinzhen Chen",
  year          = "2024",
  eprint        = "2404.04850",
  archiveprefix = "arXiv",
  primaryclass  = "cs.CL"
}
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train MaLA-LM/lucky52-bloom-7b1-no-52

Collection including MaLA-LM/lucky52-bloom-7b1-no-52