cointegrated's picture
Update README.md
09ac636
---
license: cc-by-nc-4.0
datasets:
- slone/nllb-200-10M-sample
pipeline_tag: translation
language:
- ak # aka_Latn Akan
- am # amh_Ethi Amharic
- ar # arb_Arab Modern Standard Arabic
- awa # awa_Deva Awadhi
- azj # azj_Latn North Azerbaijani
- bm # bam_Latn Bambara
- ban # ban_Latn Balinese
- be # bel_Cyrl Belarusian
- bem # bem_Latn Bemba
- bn # ben_Beng Bengali
- bho # bho_Deva Bhojpuri
- bjn # bjn_Latn Banjar (Latin script)
- bug # bug_Latn Buginese
- bg # bul_Cyrl Bulgarian
- ca # cat_Latn Catalan
- ceb # ceb_Latn Cebuano
- cs # ces_Latn Czech
- cjk # cjk_Latn Chokwe
- ckb # ckb_Arab Central Kurdish
- crh # crh_Latn Crimean Tatar
- da # dan_Latn Danish
- de # deu_Latn German
- dik # dik_Latn Southwestern Dinka
- dyu # dyu_Latn Dyula
- el # ell_Grek Greek
- en # eng_Latn English
- eo # epo_Latn Esperanto
- et # est_Latn Estonian
- ee # ewe_Latn Ewe
- fo # fao_Latn Faroese
- fj # fij_Latn Fijian
- fi # fin_Latn Finnish
- fon # fon_Latn Fon
- fr # fra_Latn French
- fur # fur_Latn Friulian
- ff # fuv_Latn Nigerian Fulfulde
- gaz # gaz_Latn West Central Oromo
- gd # gla_Latn Scottish Gaelic
- ga # gle_Latn Irish
- gl # glg_Latn Galician
- gn # grn_Latn Guarani
- gu # guj_Gujr Gujarati
- ht # hat_Latn Haitian Creole
- ha # hau_Latn Hausa
- he # heb_Hebr Hebrew
- hi # hin_Deva Hindi
- hne # hne_Deva Chhattisgarhi
- hr # hrv_Latn Croatian
- hu # hun_Latn Hungarian
- hy # hye_Armn Armenian
- ig # ibo_Latn Igbo
- ilo # ilo_Latn Ilocano
- id # ind_Latn Indonesian
- is # isl_Latn Icelandic
- it # ita_Latn Italian
- jv # jav_Latn Javanese
- ja # jpn_Jpan Japanese
- kab # kab_Latn Kabyle
- kac # kac_Latn Jingpho
- kam # kam_Latn Kamba
- kn # kan_Knda Kannada
- ks # kas_Arab Kashmiri (Arabic script)
- ks # kas_Deva Kashmiri (Devanagari script)
- ka # kat_Geor Georgian
- kk # kaz_Cyrl Kazakh
- kbp # kbp_Latn Kabiyè
- kea # kea_Latn Kabuverdianu
- mn # khk_Cyrl Halh Mongolian
- km # khm_Khmr Khmer
- ki # kik_Latn Kikuyu
- rw # kin_Latn Kinyarwanda
- ky # kir_Cyrl Kyrgyz
- kmb # kmb_Latn Kimbundu
- kmr # kmr_Latn Northern Kurdish
- kr # knc_Arab Central Kanuri (Arabic script)
- kr # knc_Latn Central Kanuri (Latin script)
- kg # kon_Latn Kikongo
- ko # kor_Hang Korean
- lo # lao_Laoo Lao
- lij # lij_Latn Ligurian
- li # lim_Latn Limburgish
- ln # lin_Latn Lingala
- lt # lit_Latn Lithuanian
- lmo # lmo_Latn Lombard
- ltg # ltg_Latn Latgalian
- lb # ltz_Latn Luxembourgish
- lua # lua_Latn Luba-Kasai
- lg # lug_Latn Ganda
- luo # luo_Latn Luo
- lus # lus_Latn Mizo
- lv # lvs_Latn Standard Latvian
- mag # mag_Deva Magahi
- mai # mai_Deva Maithili
- ml # mal_Mlym Malayalam
- mr # mar_Deva Marathi
- min # min_Latn Minangkabau (Latin script)
- mk # mkd_Cyrl Macedonian
- mt # mlt_Latn Maltese
- mni # mni_Beng Meitei (Bengali script)
- mos # mos_Latn Mossi
- mi # mri_Latn Maori
- my # mya_Mymr Burmese
- nl # nld_Latn Dutch
- nb # nob_Latn Norwegian Bokmål
- ne # npi_Deva Nepali
- nso # nso_Latn Northern Sotho
- nus # nus_Latn Nuer
- ny # nya_Latn Nyanja
- oc # oci_Latn Occitan
- ory # ory_Orya Odia
- pag # pag_Latn Pangasinan
- pa # pan_Guru Eastern Panjabi
- pap # pap_Latn Papiamento
- pbt # pbt_Arab Southern Pashto
- fa # pes_Arab Western Persian
- plt # plt_Latn Plateau Malagasy
- pl # pol_Latn Polish
- pt # por_Latn Portuguese
- prs # prs_Arab Dari
- qu # quy_Latn Ayacucho Quechua
- ro # ron_Latn Romanian
- rn # run_Latn Rundi
- ru # rus_Cyrl Russian
- sg # sag_Latn Sango
- sa # san_Deva Sanskrit
- sat # sat_Beng ?
- scn # scn_Latn Sicilian
- shn # shn_Mymr Shan
- si # sin_Sinh Sinhala
- sk # slk_Latn Slovak
- sl # slv_Latn Slovenian
- sm # smo_Latn Samoan
- sn # sna_Latn Shona
- sd # snd_Arab Sindhi
- so # som_Latn Somali
- st # sot_Latn Southern Sotho
- es # spa_Latn Spanish
- sc # srd_Latn Sardinian
- sr # srp_Cyrl Serbian
- ss # ssw_Latn Swati
- su # sun_Latn Sundanese
- sv # swe_Latn Swedish
- sw # swh_Latn Swahili
- szl # szl_Latn Silesian
- ta # tam_Taml Tamil
- taq # taq_Latn Tamasheq (Latin script)
- tt # tat_Cyrl Tatar
- te # tel_Telu Telugu
- tg # tgk_Cyrl Tajik
- tl # tgl_Latn Tagalog
- ti # tir_Ethi Tigrinya
- tpi # tpi_Latn Tok Pisin
- tn # tsn_Latn Tswana
- ts # tso_Latn Tsonga
- tk # tuk_Latn Turkmen
- tum # tum_Latn Tumbuka
- tr # tur_Latn Turkish
- tw # twi_Latn Twi
- tzm # tzm_Tfng Central Atlas Tamazight
- ug # uig_Arab Uyghur
- uk # ukr_Cyrl Ukrainian
- umb # umb_Latn Umbundu
- ur # urd_Arab Urdu
- uz # uzn_Latn Northern Uzbek
- vec # vec_Latn Venetian
- vi # vie_Latn Vietnamese
- war # war_Latn Waray
- wo # wol_Latn Wolof
- xh # xho_Latn Xhosa
- yi # ydd_Hebr Eastern Yiddish
- yo # yor_Latn Yoruba
- zh # zho_Hans Chinese (Simplified)
- zh # zho_Hant Chinese (Traditional)
- ms # zsm_Latn Standard Malay
- zu # zul_Latn Zulu
---
It is a truncated version of [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model
(6 layers instead of 12, 512 hidden dimensions instead of 1024) with 175M parameters (131M of which are token embeddings).
This model was fine-tuned on the [slone/nllb-200-10M-sample](https://huggingface.co/datasets/slone/nllb-200-10M-sample) subset of
the [NLLB dataset](https://huggingface.co/datasets/allenai/nllb) with 175 languages, using only the samples with BLASER score above 3.5.
Because of its small size, it is really bad at translation, but can serve as a base model for further fine-tuning for a small number of languages.
It is recommended to [prune the vocabulary of this model](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90)
before fine-tuning, to preserve only the tokens used with the intended languages.