File size: 7,613 Bytes
d445fef
21e0ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d445fef
21e0ca7
 
 
 
 
 
 
 
 
04d0640
 
 
 
 
 
21e0ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04d0640
21e0ca7
 
 
 
 
 
04d0640
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21e0ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
966ca1a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
tags:
- Transformers
- text-classification
- multi-class-classification
languages:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
multilinguality:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
datasets:
- qanastek/MASSIVE
widget:
- text: "wake me up at five am this week"
- text: "je veux écouter la chanson de jacques brel encore une fois"
- text: "quiero escuchar la canción de arijit singh una vez más"
- text: "olly onde é que á um parque por perto onde eu possa correr"
- text: "פרק הבא בפודקאסט בבקשה"
- text: "亚马逊股价"
- text: "найди билет на поезд в санкт-петербург"
license: cc-by-4.0
---

**People Involved**

* [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1)

**Affiliations**

1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.

## Model

XLM-Roberta : [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base)

Paper : [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf)

## Demo: How to use in HuggingFace Transformers Pipeline

Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
model_name = 'qanastek/51-languages-classifier'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
res = classifier("פרק הבא בפודקאסט בבקשה")
print(res)
```

Outputs:

```python
[{'label': 'he-IL', 'score': 0.9998375177383423}]
```

## Training data

[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

### Languages

Thee model is capable of distinguish 51 languages :

- `Afrikaans - South Africa (af-ZA)`
- `Amharic - Ethiopia (am-ET)`
- `Arabic - Saudi Arabia (ar-SA)`
- `Azeri - Azerbaijan (az-AZ)`
- `Bengali - Bangladesh (bn-BD)`
- `Chinese - China (zh-CN)`
- `Chinese - Taiwan (zh-TW)`
- `Danish - Denmark (da-DK)`
- `German - Germany (de-DE)`
- `Greek - Greece (el-GR)`
- `English - United States (en-US)`
- `Spanish - Spain (es-ES)`
- `Farsi - Iran (fa-IR)`
- `Finnish - Finland (fi-FI)`
- `French - France (fr-FR)`
- `Hebrew - Israel (he-IL)`
- `Hungarian - Hungary (hu-HU)`
- `Armenian - Armenia (hy-AM)`
- `Indonesian - Indonesia (id-ID)`
- `Icelandic - Iceland (is-IS)`
- `Italian - Italy (it-IT)`
- `Japanese - Japan (ja-JP)`
- `Javanese - Indonesia (jv-ID)`
- `Georgian - Georgia (ka-GE)`
- `Khmer - Cambodia (km-KH)`
- `Korean - Korea (ko-KR)`
- `Latvian - Latvia (lv-LV)`
- `Mongolian - Mongolia (mn-MN)`
- `Malay - Malaysia (ms-MY)`
- `Burmese - Myanmar (my-MM)`
- `Norwegian - Norway (nb-NO)`
- `Dutch - Netherlands (nl-NL)`
- `Polish - Poland (pl-PL)`
- `Portuguese - Portugal (pt-PT)`
- `Romanian - Romania (ro-RO)`
- `Russian - Russia (ru-RU)`
- `Slovanian - Slovania (sl-SL)`
- `Albanian - Albania (sq-AL)`
- `Swedish - Sweden (sv-SE)`
- `Swahili - Kenya (sw-KE)`
- `Hindi - India (hi-IN)`
- `Kannada - India (kn-IN)`
- `Malayalam - India (ml-IN)`
- `Tamil - India (ta-IN)`
- `Telugu - India (te-IN)`
- `Thai - Thailand (th-TH)`
- `Tagalog - Philippines (tl-PH)`
- `Turkish - Turkey (tr-TR)`
- `Urdu - Pakistan (ur-PK)`
- `Vietnamese - Vietnam (vi-VN)`
- `Welsh - United Kingdom (cy-GB)`

## Evaluation results

```plain
              precision    recall  f1-score   support

       af-ZA     0.9821    0.9805    0.9813      2974
       am-ET     1.0000    1.0000    1.0000      2974
       ar-SA     0.9809    0.9822    0.9815      2974
       az-AZ     0.9946    0.9845    0.9895      2974
       bn-BD     0.9997    0.9990    0.9993      2974
       cy-GB     0.9970    0.9929    0.9949      2974
       da-DK     0.9575    0.9617    0.9596      2974
       de-DE     0.9906    0.9909    0.9908      2974
       el-GR     0.9997    0.9973    0.9985      2974
       en-US     0.9712    0.9866    0.9788      2974
       es-ES     0.9825    0.9842    0.9834      2974
       fa-IR     0.9940    0.9973    0.9956      2974
       fi-FI     0.9943    0.9946    0.9945      2974
       fr-FR     0.9963    0.9923    0.9943      2974
       he-IL     1.0000    0.9997    0.9998      2974
       hi-IN     1.0000    0.9980    0.9990      2974
       hu-HU     0.9983    0.9950    0.9966      2974
       hy-AM     1.0000    0.9993    0.9997      2974
       id-ID     0.9319    0.9291    0.9305      2974
       is-IS     0.9966    0.9943    0.9955      2974
       it-IT     0.9698    0.9926    0.9811      2974
       ja-JP     0.9987    0.9963    0.9975      2974
       jv-ID     0.9628    0.9744    0.9686      2974
       ka-GE     0.9993    0.9997    0.9995      2974
       km-KH     0.9867    0.9963    0.9915      2974
       kn-IN     1.0000    0.9993    0.9997      2974
       ko-KR     0.9917    0.9997    0.9956      2974
       lv-LV     0.9990    0.9950    0.9970      2974
       ml-IN     0.9997    0.9997    0.9997      2974
       mn-MN     0.9987    0.9966    0.9976      2974
       ms-MY     0.9359    0.9418    0.9388      2974
       my-MM     1.0000    0.9993    0.9997      2974
       nb-NO     0.9600    0.9533    0.9566      2974
       nl-NL     0.9850    0.9748    0.9799      2974
       pl-PL     0.9946    0.9923    0.9934      2974
       pt-PT     0.9885    0.9798    0.9841      2974
       ro-RO     0.9919    0.9916    0.9918      2974
       ru-RU     0.9976    0.9983    0.9980      2974
       sl-SL     0.9956    0.9939    0.9948      2974
       sq-AL     0.9936    0.9896    0.9916      2974
       sv-SE     0.9902    0.9842    0.9872      2974
       sw-KE     0.9867    0.9953    0.9910      2974
       ta-IN     1.0000    1.0000    1.0000      2974
       te-IN     1.0000    0.9997    0.9998      2974
       th-TH     1.0000    0.9983    0.9992      2974
       tl-PH     0.9929    0.9899    0.9914      2974
       tr-TR     0.9869    0.9872    0.9871      2974
       ur-PK     0.9983    0.9929    0.9956      2974
       vi-VN     0.9993    0.9973    0.9983      2974
       zh-CN     0.9812    0.9832    0.9822      2974
       zh-TW     0.9832    0.9815    0.9823      2974

    accuracy                         0.9889    151674
   macro avg     0.9889    0.9889    0.9889    151674
weighted avg     0.9889    0.9889    0.9889    151674
```

Keywords : language identification ; language identification ; multilingual ; classification