--- license: cc-by-nc-4.0 datasets: - slone/nllb-200-10M-sample pipeline_tag: translation language: - ak # aka_Latn Akan - am # amh_Ethi Amharic - ar # arb_Arab Modern Standard Arabic - awa # awa_Deva Awadhi - azj # azj_Latn North Azerbaijani - bm # bam_Latn Bambara - ban # ban_Latn Balinese - be # bel_Cyrl Belarusian - bem # bem_Latn Bemba - bn # ben_Beng Bengali - bho # bho_Deva Bhojpuri - bjn # bjn_Latn Banjar (Latin script) - bug # bug_Latn Buginese - bg # bul_Cyrl Bulgarian - ca # cat_Latn Catalan - ceb # ceb_Latn Cebuano - cs # ces_Latn Czech - cjk # cjk_Latn Chokwe - ckb # ckb_Arab Central Kurdish - crh # crh_Latn Crimean Tatar - da # dan_Latn Danish - de # deu_Latn German - dik # dik_Latn Southwestern Dinka - dyu # dyu_Latn Dyula - el # ell_Grek Greek - en # eng_Latn English - eo # epo_Latn Esperanto - et # est_Latn Estonian - ee # ewe_Latn Ewe - fo # fao_Latn Faroese - fj # fij_Latn Fijian - fi # fin_Latn Finnish - fon # fon_Latn Fon - fr # fra_Latn French - fur # fur_Latn Friulian - ff # fuv_Latn Nigerian Fulfulde - gaz # gaz_Latn West Central Oromo - gd # gla_Latn Scottish Gaelic - ga # gle_Latn Irish - gl # glg_Latn Galician - gn # grn_Latn Guarani - gu # guj_Gujr Gujarati - ht # hat_Latn Haitian Creole - ha # hau_Latn Hausa - he # heb_Hebr Hebrew - hi # hin_Deva Hindi - hne # hne_Deva Chhattisgarhi - hr # hrv_Latn Croatian - hu # hun_Latn Hungarian - hy # hye_Armn Armenian - ig # ibo_Latn Igbo - ilo # ilo_Latn Ilocano - id # ind_Latn Indonesian - is # isl_Latn Icelandic - it # ita_Latn Italian - jv # jav_Latn Javanese - ja # jpn_Jpan Japanese - kab # kab_Latn Kabyle - kac # kac_Latn Jingpho - kam # kam_Latn Kamba - kn # kan_Knda Kannada - ks # kas_Arab Kashmiri (Arabic script) - ks # kas_Deva Kashmiri (Devanagari script) - ka # kat_Geor Georgian - kk # kaz_Cyrl Kazakh - kbp # kbp_Latn Kabiyè - kea # kea_Latn Kabuverdianu - mn # khk_Cyrl Halh Mongolian - km # khm_Khmr Khmer - ki # kik_Latn Kikuyu - rw # kin_Latn Kinyarwanda - ky # kir_Cyrl Kyrgyz - kmb # kmb_Latn Kimbundu - kmr # kmr_Latn Northern Kurdish - kr # knc_Arab Central Kanuri (Arabic script) - kr # knc_Latn Central Kanuri (Latin script) - kg # kon_Latn Kikongo - ko # kor_Hang Korean - lo # lao_Laoo Lao - lij # lij_Latn Ligurian - li # lim_Latn Limburgish - ln # lin_Latn Lingala - lt # lit_Latn Lithuanian - lmo # lmo_Latn Lombard - ltg # ltg_Latn Latgalian - lb # ltz_Latn Luxembourgish - lua # lua_Latn Luba-Kasai - lg # lug_Latn Ganda - luo # luo_Latn Luo - lus # lus_Latn Mizo - lv # lvs_Latn Standard Latvian - mag # mag_Deva Magahi - mai # mai_Deva Maithili - ml # mal_Mlym Malayalam - mr # mar_Deva Marathi - min # min_Latn Minangkabau (Latin script) - mk # mkd_Cyrl Macedonian - mt # mlt_Latn Maltese - mni # mni_Beng Meitei (Bengali script) - mos # mos_Latn Mossi - mi # mri_Latn Maori - my # mya_Mymr Burmese - nl # nld_Latn Dutch - nb # nob_Latn Norwegian Bokmål - ne # npi_Deva Nepali - nso # nso_Latn Northern Sotho - nus # nus_Latn Nuer - ny # nya_Latn Nyanja - oc # oci_Latn Occitan - ory # ory_Orya Odia - pag # pag_Latn Pangasinan - pa # pan_Guru Eastern Panjabi - pap # pap_Latn Papiamento - pbt # pbt_Arab Southern Pashto - fa # pes_Arab Western Persian - plt # plt_Latn Plateau Malagasy - pl # pol_Latn Polish - pt # por_Latn Portuguese - prs # prs_Arab Dari - qu # quy_Latn Ayacucho Quechua - ro # ron_Latn Romanian - rn # run_Latn Rundi - ru # rus_Cyrl Russian - sg # sag_Latn Sango - sa # san_Deva Sanskrit - sat # sat_Beng ? - scn # scn_Latn Sicilian - shn # shn_Mymr Shan - si # sin_Sinh Sinhala - sk # slk_Latn Slovak - sl # slv_Latn Slovenian - sm # smo_Latn Samoan - sn # sna_Latn Shona - sd # snd_Arab Sindhi - so # som_Latn Somali - st # sot_Latn Southern Sotho - es # spa_Latn Spanish - sc # srd_Latn Sardinian - sr # srp_Cyrl Serbian - ss # ssw_Latn Swati - su # sun_Latn Sundanese - sv # swe_Latn Swedish - sw # swh_Latn Swahili - szl # szl_Latn Silesian - ta # tam_Taml Tamil - taq # taq_Latn Tamasheq (Latin script) - tt # tat_Cyrl Tatar - te # tel_Telu Telugu - tg # tgk_Cyrl Tajik - tl # tgl_Latn Tagalog - ti # tir_Ethi Tigrinya - tpi # tpi_Latn Tok Pisin - tn # tsn_Latn Tswana - ts # tso_Latn Tsonga - tk # tuk_Latn Turkmen - tum # tum_Latn Tumbuka - tr # tur_Latn Turkish - tw # twi_Latn Twi - tzm # tzm_Tfng Central Atlas Tamazight - ug # uig_Arab Uyghur - uk # ukr_Cyrl Ukrainian - umb # umb_Latn Umbundu - ur # urd_Arab Urdu - uz # uzn_Latn Northern Uzbek - vec # vec_Latn Venetian - vi # vie_Latn Vietnamese - war # war_Latn Waray - wo # wol_Latn Wolof - xh # xho_Latn Xhosa - yi # ydd_Hebr Eastern Yiddish - yo # yor_Latn Yoruba - zh # zho_Hans Chinese (Simplified) - zh # zho_Hant Chinese (Traditional) - ms # zsm_Latn Standard Malay - zu # zul_Latn Zulu --- It is a truncated version of [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model (6 layers instead of 12, 512 hidden dimensions instead of 1024) with 175M parameters (131M of which are token embeddings). This model was fine-tuned on the [slone/nllb-200-10M-sample](https://huggingface.co/datasets/slone/nllb-200-10M-sample) subset of the [NLLB dataset](https://huggingface.co/datasets/allenai/nllb) with 175 languages, using only the samples with BLASER score above 3.5. Because of its small size, it is really bad at translation, but can serve as a base model for further fine-tuning for a small number of languages. It is recommended to [prune the vocabulary of this model](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) before fine-tuning, to preserve only the tokens used with the intended languages.