File size: 7,777 Bytes
0769cee
b0a8168
 
 
 
8d82603
 
b0a8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0769cee
 
b0a8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f45f707
 
b0a8168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
790fd5d
b0a8168
790fd5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0a8168
790fd5d
 
 
b0a8168
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
tags:
- Transformers
- Token Classification
- Slot Annotation
- token-classification
- sequence-tagger-model
languages:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
multilinguality:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
datasets:
- qanastek/MASSIVE
widget:
- text: "wake me up at five am this week"
- text: "je veux écouter la chanson de jacques brel encore une fois"
- text: "quiero escuchar la canción de arijit singh una vez más"
- text: "olly onde é que á um parque por perto onde eu possa correr"
- text: "פרק הבא בפודקאסט בבקשה"
- text: "亚马逊股价"
- text: "найди билет на поезд в санкт-петербург"
license: cc-by-4.0
---

**People Involved**

* [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1)

**Affiliations**

1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.

## Demo: How to use in HuggingFace Transformers Pipeline

Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
model = AutoModelForTokenClassification.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
predict = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
res = predict("réveille-moi à neuf heures du matin le vendredi")
print(res)
```

Outputs:

![English - Hebrew - Spanish](123.png)

```python
[{'word': '▁neuf', 'score': 0.9911066293716431, 'entity': 'B-time', 'index': 6, 'start': 15, 'end': 19},
{'word': '▁heures', 'score': 0.9200698733329773, 'entity': 'I-time', 'index': 7, 'start': 20, 'end': 26},
{'word': '▁du', 'score': 0.8476170897483826, 'entity': 'I-time', 'index': 8, 'start': 27, 'end': 29},
{'word': '▁matin', 'score': 0.8271021246910095, 'entity': 'I-time', 'index': 9, 'start': 30, 'end': 35},
{'word': '▁vendredi', 'score': 0.9813069701194763, 'entity': 'B-date', 'index': 11, 'start': 39, 'end': 47}]
```

## Training data

[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

## Named Entities

* O
* currency_name
* personal_info
* app_name
* list_name
* alarm_type
* cooking_type
* time_zone
* media_type
* change_amount
* transport_type
* drink_type
* news_topic
* artist_name
* weather_descriptor
* transport_name
* player_setting
* email_folder
* music_album
* coffee_type
* meal_type
* song_name
* date
* movie_type
* movie_name
* game_name
* business_type
* music_descriptor
* joke_type
* music_genre
* device_type
* house_place
* place_name
* sport_type
* podcast_name
* game_type
* timeofday
* business_name
* time
* definition_word
* audiobook_author
* event_name
* general_frequency
* relation
* color_type
* audiobook_name
* food_type
* person
* transport_agency
* email_address
* podcast_descriptor
* order_type
* ingredient
* transport_descriptor
* playlist_name
* radio_name

## Evaluation results

```plain
                      precision    recall  f1-score   support

                   O     0.9537    0.9498    0.9517   1031927
          alarm_type     0.8214    0.1800    0.2953       511
            app_name     0.3448    0.5318    0.4184       660
         artist_name     0.7143    0.8487    0.7757     11413
    audiobook_author     0.7038    0.2971    0.4178      1232
      audiobook_name     0.7271    0.5381    0.6185      5090
       business_name     0.8301    0.7862    0.8075     15385
       business_type     0.7009    0.6196    0.6577      4600
       change_amount     0.8179    0.9104    0.8617      1663
         coffee_type     0.6147    0.8322    0.7071       876
          color_type     0.6999    0.9176    0.7941      2890
        cooking_type     0.7037    0.5184    0.5970      1003
       currency_name     0.8479    0.9686    0.9042      6501
                date     0.8667    0.9348    0.8995     49866
     definition_word     0.9043    0.8135    0.8565      8333
         device_type     0.8502    0.8825    0.8661     11631
          drink_type     0.0000    0.0000    0.0000       131
       email_address     0.9715    0.9747    0.9731      3986
        email_folder     0.5913    0.9740    0.7359       884
          event_name     0.7659    0.7630    0.7645     38625
           food_type     0.6502    0.8697    0.7441     12353
           game_name     0.8974    0.6275    0.7386      4518
   general_frequency     0.8012    0.8673    0.8329      3173
         house_place     0.9337    0.9168    0.9252      7067
          ingredient     0.5481    0.0491    0.0901      1161
           joke_type     0.8147    0.9101    0.8598      1435
           list_name     0.8411    0.7275    0.7802      8188
           meal_type     0.6072    0.8926    0.7227      2282
          media_type     0.8578    0.8522    0.8550     17751
          movie_name     0.4598    0.1856    0.2645       431
          movie_type     0.2673    0.4341    0.3309       364
         music_album     0.0000    0.0000    0.0000       146
    music_descriptor     0.2906    0.3979    0.3359      1053
         music_genre     0.7999    0.7483    0.7732      5908
          news_topic     0.7052    0.5702    0.6306      9265
          order_type     0.6374    0.8845    0.7409      2614
              person     0.8173    0.9376    0.8733     33708
       personal_info     0.7035    0.7444    0.7234      1976
          place_name     0.8616    0.8228    0.8417     38881
      player_setting     0.6429    0.6212    0.6319      5409
       playlist_name     0.5852    0.5293    0.5559      3671
  podcast_descriptor     0.7486    0.5413    0.6283      4951
        podcast_name     0.6858    0.5675    0.6211      3339
          radio_name     0.8196    0.8013    0.8103      9892
            relation     0.6662    0.8569    0.7496      6477
           song_name     0.5617    0.7527    0.6433      7251
          sport_type     0.0000    0.0000    0.0000         0
                time     0.9032    0.8195    0.8593     35456
           time_zone     0.8368    0.4467    0.5824      2823
           timeofday     0.7931    0.8459    0.8187      6140
    transport_agency     0.7876    0.7764    0.7820      1051
transport_descriptor     0.5738    0.2756    0.3723       254
      transport_name     0.8497    0.5149    0.6412      1010
      transport_type     0.9303    0.8980    0.9139      6363
  weather_descriptor     0.8584    0.7466    0.7986     11702

            accuracy                         0.9092   1455270
           macro avg     0.6940    0.6668    0.6613   1455270
        weighted avg     0.9111    0.9092    0.9086   1455270
```