File size: 6,151 Bytes
5e2f5dd 9b9680e 5e2f5dd d65d330 5e2f5dd bf0e818 5e2f5dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
---
license: cc-by-nc-4.0
datasets:
- slone/nllb-200-10M-sample
pipeline_tag: translation
language:
- ak # aka_Latn Akan
- am # amh_Ethi Amharic
- ar # arb_Arab Modern Standard Arabic
- awa # awa_Deva Awadhi
- azj # azj_Latn North Azerbaijani
- bm # bam_Latn Bambara
- ban # ban_Latn Balinese
- be # bel_Cyrl Belarusian
- bem # bem_Latn Bemba
- bn # ben_Beng Bengali
- bho # bho_Deva Bhojpuri
- bjn # bjn_Latn Banjar (Latin script)
- bug # bug_Latn Buginese
- bg # bul_Cyrl Bulgarian
- ca # cat_Latn Catalan
- ceb # ceb_Latn Cebuano
- cs # ces_Latn Czech
- cjk # cjk_Latn Chokwe
- ckb # ckb_Arab Central Kurdish
- crh # crh_Latn Crimean Tatar
- da # dan_Latn Danish
- de # deu_Latn German
- dik # dik_Latn Southwestern Dinka
- dyu # dyu_Latn Dyula
- el # ell_Grek Greek
- en # eng_Latn English
- eo # epo_Latn Esperanto
- et # est_Latn Estonian
- ee # ewe_Latn Ewe
- fo # fao_Latn Faroese
- fj # fij_Latn Fijian
- fi # fin_Latn Finnish
- fon # fon_Latn Fon
- fr # fra_Latn French
- fur # fur_Latn Friulian
- ff # fuv_Latn Nigerian Fulfulde
- gaz # gaz_Latn West Central Oromo
- gd # gla_Latn Scottish Gaelic
- ga # gle_Latn Irish
- gl # glg_Latn Galician
- gn # grn_Latn Guarani
- gu # guj_Gujr Gujarati
- ht # hat_Latn Haitian Creole
- ha # hau_Latn Hausa
- he # heb_Hebr Hebrew
- hi # hin_Deva Hindi
- hne # hne_Deva Chhattisgarhi
- hr # hrv_Latn Croatian
- hu # hun_Latn Hungarian
- hy # hye_Armn Armenian
- ig # ibo_Latn Igbo
- ilo # ilo_Latn Ilocano
- id # ind_Latn Indonesian
- is # isl_Latn Icelandic
- it # ita_Latn Italian
- jv # jav_Latn Javanese
- ja # jpn_Jpan Japanese
- kab # kab_Latn Kabyle
- kac # kac_Latn Jingpho
- kam # kam_Latn Kamba
- kn # kan_Knda Kannada
- ks # kas_Arab Kashmiri (Arabic script)
- ks # kas_Deva Kashmiri (Devanagari script)
- ka # kat_Geor Georgian
- kk # kaz_Cyrl Kazakh
- kbp # kbp_Latn Kabiyè
- kea # kea_Latn Kabuverdianu
- mn # khk_Cyrl Halh Mongolian
- km # khm_Khmr Khmer
- ki # kik_Latn Kikuyu
- rw # kin_Latn Kinyarwanda
- ky # kir_Cyrl Kyrgyz
- kmb # kmb_Latn Kimbundu
- kmr # kmr_Latn Northern Kurdish
- kr # knc_Arab Central Kanuri (Arabic script)
- kr # knc_Latn Central Kanuri (Latin script)
- kg # kon_Latn Kikongo
- ko # kor_Hang Korean
- lo # lao_Laoo Lao
- lij # lij_Latn Ligurian
- li # lim_Latn Limburgish
- ln # lin_Latn Lingala
- lt # lit_Latn Lithuanian
- lmo # lmo_Latn Lombard
- ltg # ltg_Latn Latgalian
- lb # ltz_Latn Luxembourgish
- lua # lua_Latn Luba-Kasai
- lg # lug_Latn Ganda
- luo # luo_Latn Luo
- lus # lus_Latn Mizo
- lv # lvs_Latn Standard Latvian
- mag # mag_Deva Magahi
- mai # mai_Deva Maithili
- ml # mal_Mlym Malayalam
- mr # mar_Deva Marathi
- min # min_Latn Minangkabau (Latin script)
- mk # mkd_Cyrl Macedonian
- mt # mlt_Latn Maltese
- mni # mni_Beng Meitei (Bengali script)
- mos # mos_Latn Mossi
- mi # mri_Latn Maori
- my # mya_Mymr Burmese
- nl # nld_Latn Dutch
- nb # nob_Latn Norwegian Bokmål
- ne # npi_Deva Nepali
- nso # nso_Latn Northern Sotho
- nus # nus_Latn Nuer
- ny # nya_Latn Nyanja
- oc # oci_Latn Occitan
- ory # ory_Orya Odia
- pag # pag_Latn Pangasinan
- pa # pan_Guru Eastern Panjabi
- pap # pap_Latn Papiamento
- pbt # pbt_Arab Southern Pashto
- fa # pes_Arab Western Persian
- plt # plt_Latn Plateau Malagasy
- pl # pol_Latn Polish
- pt # por_Latn Portuguese
- prs # prs_Arab Dari
- qu # quy_Latn Ayacucho Quechua
- ro # ron_Latn Romanian
- rn # run_Latn Rundi
- ru # rus_Cyrl Russian
- sg # sag_Latn Sango
- sa # san_Deva Sanskrit
- sat # sat_Beng ?
- scn # scn_Latn Sicilian
- shn # shn_Mymr Shan
- si # sin_Sinh Sinhala
- sk # slk_Latn Slovak
- sl # slv_Latn Slovenian
- sm # smo_Latn Samoan
- sn # sna_Latn Shona
- sd # snd_Arab Sindhi
- so # som_Latn Somali
- st # sot_Latn Southern Sotho
- es # spa_Latn Spanish
- sc # srd_Latn Sardinian
- sr # srp_Cyrl Serbian
- ss # ssw_Latn Swati
- su # sun_Latn Sundanese
- sv # swe_Latn Swedish
- sw # swh_Latn Swahili
- szl # szl_Latn Silesian
- ta # tam_Taml Tamil
- taq # taq_Latn Tamasheq (Latin script)
- tt # tat_Cyrl Tatar
- te # tel_Telu Telugu
- tg # tgk_Cyrl Tajik
- tl # tgl_Latn Tagalog
- ti # tir_Ethi Tigrinya
- tpi # tpi_Latn Tok Pisin
- tn # tsn_Latn Tswana
- ts # tso_Latn Tsonga
- tk # tuk_Latn Turkmen
- tum # tum_Latn Tumbuka
- tr # tur_Latn Turkish
- tw # twi_Latn Twi
- tzm # tzm_Tfng Central Atlas Tamazight
- ug # uig_Arab Uyghur
- uk # ukr_Cyrl Ukrainian
- umb # umb_Latn Umbundu
- ur # urd_Arab Urdu
- uz # uzn_Latn Northern Uzbek
- vec # vec_Latn Venetian
- vi # vie_Latn Vietnamese
- war # war_Latn Waray
- wo # wol_Latn Wolof
- xh # xho_Latn Xhosa
- yi # ydd_Hebr Eastern Yiddish
- yo # yor_Latn Yoruba
- zh # zho_Hans Chinese (Simplified)
- zh # zho_Hant Chinese (Traditional)
- ms # zsm_Latn Standard Malay
- zu # zul_Latn Zulu
---
It is a truncated version of [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model
(6 layers instead of 12, 512 hidden dimensions instead of 1024) with 175M parameters (131M of which are token embeddings).
This model was fine-tuned on the [slone/nllb-200-10M-sample](https://huggingface.co/datasets/slone/nllb-200-10M-sample) subset of
the [NLLB dataset](https://huggingface.co/datasets/allenai/nllb) with 175 languages, using only the samples with BLASER score above 3.5.
Because of its small size, it is really bad at translation, but can serve as a base model for further fine-tuning for a small number of languages.
It is recommended to prune the vocabulary of this model before fine-tuning, to preserve only the tokens used with the intended languages. |