File size: 11,264 Bytes
0769cee
b0a8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0769cee
 
b0a8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
tags:
- Transformers
- Token Classification
- Slot Annotation
languages:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
multilinguality:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
datasets:
- qanastek/MASSIVE
widget:
- text: "wake me up at five am this week"
- text: "je veux écouter la chanson de jacques brel encore une fois"
- text: "quiero escuchar la canción de arijit singh una vez más"
- text: "olly onde é que á um parque por perto onde eu possa correr"
- text: "פרק הבא בפודקאסט בבקשה"
- text: "亚马逊股价"
- text: "найди билет на поезд в санкт-петербург"
license: cc-by-4.0
---

**People Involved**

* [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1)

**Affiliations**

1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.

## Demo: How to use in HuggingFace Transformers Pipeline

Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
model = AutoModelForTokenClassification.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
predict = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
res = predict("réveille-moi à neuf heures du matin le vendredi")
print(res)
```

Outputs:

```python
[{'word': '▁neuf', 'score': 0.9911066293716431, 'entity': 'B-time', 'index': 6, 'start': 15, 'end': 19},
{'word': '▁heures', 'score': 0.9200698733329773, 'entity': 'I-time', 'index': 7, 'start': 20, 'end': 26},
{'word': '▁du', 'score': 0.8476170897483826, 'entity': 'I-time', 'index': 8, 'start': 27, 'end': 29},
{'word': '▁matin', 'score': 0.8271021246910095, 'entity': 'I-time', 'index': 9, 'start': 30, 'end': 35},
{'word': '▁vendredi', 'score': 0.9813069701194763, 'entity': 'B-date', 'index': 11, 'start': 39, 'end': 47}]
```

## Training data

[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

## Named Entities

* O
* currency_name
* personal_info
* app_name
* list_name
* alarm_type
* cooking_type
* time_zone
* media_type
* change_amount
* transport_type
* drink_type
* news_topic
* artist_name
* weather_descriptor
* transport_name
* player_setting
* email_folder
* music_album
* coffee_type
* meal_type
* song_name
* date
* movie_type
* movie_name
* game_name
* business_type
* music_descriptor
* joke_type
* music_genre
* device_type
* house_place
* place_name
* sport_type
* podcast_name
* game_type
* timeofday
* business_name
* time
* definition_word
* audiobook_author
* event_name
* general_frequency
* relation
* color_type
* audiobook_name
* food_type
* person
* transport_agency
* email_address
* podcast_descriptor
* order_type
* ingredient
* transport_descriptor
* playlist_name
* radio_name

## Evaluation results

```plain
                        precision    recall  f1-score   support

          B-alarm_type     0.8077    0.1074    0.1896       391
            B-app_name     0.3954    0.5581    0.4629       559
         B-artist_name     0.7424    0.8188    0.7787      7594
    B-audiobook_author     0.7626    0.2751    0.4044       607
      B-audiobook_name     0.7448    0.6036    0.6668      2863
       B-business_name     0.7705    0.8122    0.7908     11120
       B-business_type     0.6715    0.6630    0.6672      3596
       B-change_amount     0.7830    0.7719    0.7774       846
         B-coffee_type     0.4705    0.7298    0.5722       459
          B-color_type     0.6674    0.9244    0.7751      2618
        B-cooking_type     0.8630    0.4087    0.5547       986
       B-currency_name     0.8851    0.9250    0.9046      4496
                B-date     0.8638    0.9276    0.8946     33909
     B-definition_word     0.9051    0.8398    0.8712      6474
         B-device_type     0.8446    0.8407    0.8426      7459
          B-drink_type     0.0000    0.0000    0.0000       127
       B-email_address     0.9084    0.9684    0.9374      1075
        B-email_folder     0.7464    0.9457    0.8343       663
          B-event_name     0.7695    0.7648    0.7671     26880
           B-food_type     0.6861    0.8487    0.7588      9047
           B-game_name     0.8716    0.7366    0.7984      2866
   B-general_frequency     0.7587    0.7920    0.7750      1548
         B-house_place     0.9137    0.8814    0.8972      5765
          B-ingredient     0.6176    0.0479    0.0890       876
           B-joke_type     0.8029    0.7780    0.7903      1126
           B-list_name     0.8563    0.7582    0.8043      6195
           B-meal_type     0.5883    0.9148    0.7161      1948
          B-media_type     0.9111    0.8302    0.8688     13301
          B-movie_name     0.5312    0.3696    0.4359       230
          B-movie_type     0.2829    0.3931    0.3290       290
         B-music_album     0.0000    0.0000    0.0000       104
    B-music_descriptor     0.2987    0.2500    0.2722       760
         B-music_genre     0.7731    0.7953    0.7840      4821
          B-news_topic     0.6437    0.6668    0.6551      5441
          B-order_type     0.6739    0.8073    0.7346      2091
              B-person     0.8290    0.9138    0.8693     25490
       B-personal_info     0.6177    0.6765    0.6458      1249
          B-place_name     0.8696    0.8252    0.8468     30683
      B-player_setting     0.6816    0.6156    0.6469      4048
       B-playlist_name     0.5858    0.4923    0.5350      1942
  B-podcast_descriptor     0.7211    0.5209    0.6049      2367
        B-podcast_name     0.6930    0.5462    0.6109      2091
          B-radio_name     0.7304    0.7598    0.7448      4126
            B-relation     0.7852    0.8708    0.8258      5689
           B-song_name     0.6053    0.6909    0.6453      4131
          B-sport_type     0.0000    0.0000    0.0000         0
                B-time     0.8729    0.7341    0.7975     17338
           B-time_zone     0.7031    0.6352    0.6674      1428
           B-timeofday     0.7367    0.8168    0.7747      4853
    B-transport_agency     0.8161    0.7100    0.7594      1000
B-transport_descriptor     0.8333    0.1014    0.1807       148
      B-transport_name     0.7979    0.3002    0.4363       513
      B-transport_type     0.9332    0.8942    0.9133      5858
  B-weather_descriptor     0.8501    0.7775    0.8122      8815
          I-alarm_type     1.0000    0.1500    0.2609       120
            I-app_name     0.0000    0.0000    0.0000       101
         I-artist_name     0.7325    0.8246    0.7758      3819
    I-audiobook_author     0.8373    0.2800    0.4197       625
      I-audiobook_name     0.6729    0.5007    0.5742      2227
       I-business_name     0.7715    0.6326    0.6952      4265
       I-business_type     0.5932    0.4661    0.5220      1004
       I-change_amount     0.7876    0.8715    0.8274       817
         I-coffee_type     0.9160    0.5755    0.7069       417
          I-color_type     0.2781    0.1912    0.2266       272
        I-cooking_type     0.0000    0.0000    0.0000        17
       I-currency_name     0.8220    0.9327    0.8738      2005
                I-date     0.7864    0.8482    0.8161     15957
     I-definition_word     0.8707    0.6955    0.7733      1859
         I-device_type     0.8837    0.8907    0.8872      4172
          I-drink_type     0.0000    0.0000    0.0000         4
       I-email_address     0.9701    0.9794    0.9747      2911
        I-email_folder     0.6881    0.9186    0.7868       221
          I-event_name     0.6499    0.5553    0.5989     11745
           I-food_type     0.6568    0.7825    0.7141      3306
           I-game_name     0.8183    0.6053    0.6959      1652
   I-general_frequency     0.8168    0.8338    0.8252      1625
         I-house_place     0.9368    0.7289    0.8199      1302
          I-ingredient     0.2857    0.0070    0.0137       285
           I-joke_type     0.7941    0.6117    0.6910       309
           I-list_name     0.7382    0.3989    0.5179      1993
           I-meal_type     0.5699    0.8174    0.6716       334
          I-media_type     0.8355    0.7420    0.7860      4450
          I-movie_name     0.6250    0.1244    0.2075       201
          I-movie_type     0.0190    0.0541    0.0281        74
         I-music_album     0.0000    0.0000    0.0000        42
    I-music_descriptor     0.4231    0.5631    0.4832       293
         I-music_genre     0.6716    0.5455    0.6020      1087
          I-news_topic     0.6384    0.4187    0.5057      3824
          I-order_type     0.4850    0.8050    0.6053       523
              I-person     0.7844    0.8400    0.8113      8218
       I-personal_info     0.7343    0.8363    0.7820       727
          I-place_name     0.6855    0.6703    0.6778      8198
      I-player_setting     0.5293    0.4982    0.5132      1361
       I-playlist_name     0.4696    0.5182    0.4927      1729
  I-podcast_descriptor     0.7448    0.4563    0.5659      2584
        I-podcast_name     0.5923    0.5192    0.5534      1248
          I-radio_name     0.7907    0.8404    0.8148      5766
            I-relation     0.4455    0.4873    0.4655       788
           I-song_name     0.5495    0.7433    0.6319      3120
          I-sport_type     0.0000    0.0000    0.0000         0
                I-time     0.8115    0.8288    0.8201     18118
           I-time_zone     0.9708    0.1670    0.2850      1395
           I-timeofday     0.7531    0.5734    0.6511      1287
    I-transport_agency     0.6667    0.1961    0.3030        51
I-transport_descriptor     0.6429    0.0849    0.1500       106
      I-transport_name     0.8469    0.5453    0.6634       497
      I-transport_type     0.9556    0.7248    0.8243       505
  I-weather_descriptor     0.7756    0.4046    0.5318      2887
                     O     0.9502    0.9560    0.9531   1031927

              accuracy                         0.9031   1455270
             macro avg     0.6674    0.5860    0.5972   1455270
          weighted avg     0.9032    0.9031    0.9013   1455270
```