File size: 6,258 Bytes
5e2f5dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b9680e
5e2f5dd
d65d330
 
5e2f5dd
bf0e818
09ac636
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
license: cc-by-nc-4.0
datasets:
- slone/nllb-200-10M-sample
pipeline_tag: translation
language:
  - ak   # aka_Latn Akan
  - am   # amh_Ethi Amharic
  - ar   # arb_Arab Modern Standard Arabic
  - awa  # awa_Deva Awadhi
  - azj  # azj_Latn North Azerbaijani
  - bm   # bam_Latn Bambara
  - ban  # ban_Latn Balinese
  - be   # bel_Cyrl Belarusian
  - bem  # bem_Latn Bemba
  - bn   # ben_Beng Bengali
  - bho  # bho_Deva Bhojpuri
  - bjn  # bjn_Latn Banjar (Latin script)
  - bug  # bug_Latn Buginese
  - bg   # bul_Cyrl Bulgarian
  - ca   # cat_Latn Catalan
  - ceb  # ceb_Latn Cebuano
  - cs   # ces_Latn Czech
  - cjk  # cjk_Latn Chokwe
  - ckb  # ckb_Arab Central Kurdish
  - crh  # crh_Latn Crimean Tatar
  - da   # dan_Latn Danish
  - de   # deu_Latn German
  - dik  # dik_Latn Southwestern Dinka
  - dyu  # dyu_Latn Dyula
  - el   # ell_Grek Greek
  - en   # eng_Latn English
  - eo   # epo_Latn Esperanto
  - et   # est_Latn Estonian
  - ee   # ewe_Latn Ewe
  - fo   # fao_Latn Faroese
  - fj   # fij_Latn Fijian
  - fi   # fin_Latn Finnish
  - fon  # fon_Latn Fon
  - fr   # fra_Latn French
  - fur  # fur_Latn Friulian
  - ff   # fuv_Latn Nigerian Fulfulde
  - gaz  # gaz_Latn West Central Oromo
  - gd   # gla_Latn Scottish Gaelic
  - ga   # gle_Latn Irish
  - gl   # glg_Latn Galician
  - gn   # grn_Latn Guarani
  - gu   # guj_Gujr Gujarati
  - ht   # hat_Latn Haitian Creole
  - ha   # hau_Latn Hausa
  - he   # heb_Hebr Hebrew
  - hi   # hin_Deva Hindi
  - hne  # hne_Deva Chhattisgarhi
  - hr   # hrv_Latn Croatian
  - hu   # hun_Latn Hungarian
  - hy   # hye_Armn Armenian
  - ig   # ibo_Latn Igbo
  - ilo  # ilo_Latn Ilocano
  - id   # ind_Latn Indonesian
  - is   # isl_Latn Icelandic
  - it   # ita_Latn Italian
  - jv   # jav_Latn Javanese
  - ja   # jpn_Jpan Japanese
  - kab  # kab_Latn Kabyle
  - kac  # kac_Latn Jingpho
  - kam  # kam_Latn Kamba
  - kn   # kan_Knda Kannada
  - ks   # kas_Arab Kashmiri (Arabic script)
  - ks   # kas_Deva Kashmiri (Devanagari script)
  - ka   # kat_Geor Georgian
  - kk   # kaz_Cyrl Kazakh
  - kbp  # kbp_Latn Kabiyè
  - kea  # kea_Latn Kabuverdianu
  - mn   # khk_Cyrl Halh Mongolian
  - km   # khm_Khmr Khmer
  - ki   # kik_Latn Kikuyu
  - rw   # kin_Latn Kinyarwanda
  - ky   # kir_Cyrl Kyrgyz
  - kmb  # kmb_Latn Kimbundu
  - kmr  # kmr_Latn Northern Kurdish
  - kr   # knc_Arab Central Kanuri (Arabic script)
  - kr   # knc_Latn Central Kanuri (Latin script)
  - kg   # kon_Latn Kikongo
  - ko   # kor_Hang Korean
  - lo   # lao_Laoo Lao
  - lij  # lij_Latn Ligurian
  - li   # lim_Latn Limburgish
  - ln   # lin_Latn Lingala
  - lt   # lit_Latn Lithuanian
  - lmo  # lmo_Latn Lombard
  - ltg  # ltg_Latn Latgalian
  - lb   # ltz_Latn Luxembourgish
  - lua  # lua_Latn Luba-Kasai
  - lg   # lug_Latn Ganda
  - luo  # luo_Latn Luo
  - lus  # lus_Latn Mizo
  - lv   # lvs_Latn Standard Latvian
  - mag  # mag_Deva Magahi
  - mai  # mai_Deva Maithili
  - ml   # mal_Mlym Malayalam
  - mr   # mar_Deva Marathi
  - min  # min_Latn Minangkabau (Latin script)
  - mk   # mkd_Cyrl Macedonian
  - mt   # mlt_Latn Maltese
  - mni  # mni_Beng Meitei (Bengali script)
  - mos  # mos_Latn Mossi
  - mi   # mri_Latn Maori
  - my   # mya_Mymr Burmese
  - nl   # nld_Latn Dutch
  - nb   # nob_Latn Norwegian Bokmål
  - ne   # npi_Deva Nepali
  - nso  # nso_Latn Northern Sotho
  - nus  # nus_Latn Nuer
  - ny   # nya_Latn Nyanja
  - oc   # oci_Latn Occitan
  - ory  # ory_Orya Odia
  - pag  # pag_Latn Pangasinan
  - pa   # pan_Guru Eastern Panjabi
  - pap  # pap_Latn Papiamento
  - pbt  # pbt_Arab Southern Pashto
  - fa   # pes_Arab Western Persian
  - plt  # plt_Latn Plateau Malagasy
  - pl   # pol_Latn Polish
  - pt   # por_Latn Portuguese
  - prs  # prs_Arab Dari
  - qu   # quy_Latn Ayacucho Quechua
  - ro   # ron_Latn Romanian
  - rn   # run_Latn Rundi
  - ru   # rus_Cyrl Russian
  - sg   # sag_Latn Sango
  - sa   # san_Deva Sanskrit
  - sat  # sat_Beng ?
  - scn  # scn_Latn Sicilian
  - shn  # shn_Mymr Shan
  - si   # sin_Sinh Sinhala
  - sk   # slk_Latn Slovak
  - sl   # slv_Latn Slovenian
  - sm   # smo_Latn Samoan
  - sn   # sna_Latn Shona
  - sd   # snd_Arab Sindhi
  - so   # som_Latn Somali
  - st   # sot_Latn Southern Sotho
  - es   # spa_Latn Spanish
  - sc   # srd_Latn Sardinian
  - sr   # srp_Cyrl Serbian
  - ss   # ssw_Latn Swati
  - su   # sun_Latn Sundanese
  - sv   # swe_Latn Swedish
  - sw   # swh_Latn Swahili
  - szl  # szl_Latn Silesian
  - ta   # tam_Taml Tamil
  - taq  # taq_Latn Tamasheq (Latin script)
  - tt   # tat_Cyrl Tatar
  - te   # tel_Telu Telugu
  - tg   # tgk_Cyrl Tajik
  - tl   # tgl_Latn Tagalog
  - ti   # tir_Ethi Tigrinya
  - tpi  # tpi_Latn Tok Pisin
  - tn   # tsn_Latn Tswana
  - ts   # tso_Latn Tsonga
  - tk   # tuk_Latn Turkmen
  - tum  # tum_Latn Tumbuka
  - tr   # tur_Latn Turkish
  - tw   # twi_Latn Twi
  - tzm  # tzm_Tfng Central Atlas Tamazight
  - ug   # uig_Arab Uyghur
  - uk   # ukr_Cyrl Ukrainian
  - umb  # umb_Latn Umbundu
  - ur   # urd_Arab Urdu
  - uz   # uzn_Latn Northern Uzbek
  - vec  # vec_Latn Venetian
  - vi   # vie_Latn Vietnamese
  - war  # war_Latn Waray
  - wo   # wol_Latn Wolof
  - xh   # xho_Latn Xhosa
  - yi   # ydd_Hebr Eastern Yiddish
  - yo   # yor_Latn Yoruba
  - zh   # zho_Hans Chinese (Simplified)
  - zh   # zho_Hant Chinese (Traditional)
  - ms   # zsm_Latn Standard Malay
  - zu   # zul_Latn Zulu
---

It is a truncated version of [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model 
(6 layers instead of 12, 512 hidden dimensions instead of 1024) with 175M parameters (131M of which are token embeddings).

This model was fine-tuned on the [slone/nllb-200-10M-sample](https://huggingface.co/datasets/slone/nllb-200-10M-sample) subset of 
the [NLLB dataset](https://huggingface.co/datasets/allenai/nllb) with 175 languages, using only the samples with BLASER score above 3.5.

Because of its small size, it is really bad at translation, but can serve as a base model for further fine-tuning for a small number of languages.
It is recommended to [prune the vocabulary of this model](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) 
before fine-tuning, to preserve only the tokens used with the intended languages.