qanastek commited on
Commit
b0a8168
1 Parent(s): 0769cee

First commit

Browse files
.gitattributes CHANGED
@@ -25,3 +25,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ *.json filter=lfs diff=lfs merge=lfs -text
29
+ *.log filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,336 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - Transformers
4
+ - Token Classification
5
+ - Slot Annotation
6
+ languages:
7
+ - af-ZA
8
+ - am-ET
9
+ - ar-SA
10
+ - az-AZ
11
+ - bn-BD
12
+ - cy-GB
13
+ - da-DK
14
+ - de-DE
15
+ - el-GR
16
+ - en-US
17
+ - es-ES
18
+ - fa-IR
19
+ - fi-FI
20
+ - fr-FR
21
+ - he-IL
22
+ - hi-IN
23
+ - hu-HU
24
+ - hy-AM
25
+ - id-ID
26
+ - is-IS
27
+ - it-IT
28
+ - ja-JP
29
+ - jv-ID
30
+ - ka-GE
31
+ - km-KH
32
+ - kn-IN
33
+ - ko-KR
34
+ - lv-LV
35
+ - ml-IN
36
+ - mn-MN
37
+ - ms-MY
38
+ - my-MM
39
+ - nb-NO
40
+ - nl-NL
41
+ - pl-PL
42
+ - pt-PT
43
+ - ro-RO
44
+ - ru-RU
45
+ - sl-SL
46
+ - sq-AL
47
+ - sv-SE
48
+ - sw-KE
49
+ - ta-IN
50
+ - te-IN
51
+ - th-TH
52
+ - tl-PH
53
+ - tr-TR
54
+ - ur-PK
55
+ - vi-VN
56
+ - zh-CN
57
+ - zh-TW
58
+ multilinguality:
59
+ - af-ZA
60
+ - am-ET
61
+ - ar-SA
62
+ - az-AZ
63
+ - bn-BD
64
+ - cy-GB
65
+ - da-DK
66
+ - de-DE
67
+ - el-GR
68
+ - en-US
69
+ - es-ES
70
+ - fa-IR
71
+ - fi-FI
72
+ - fr-FR
73
+ - he-IL
74
+ - hi-IN
75
+ - hu-HU
76
+ - hy-AM
77
+ - id-ID
78
+ - is-IS
79
+ - it-IT
80
+ - ja-JP
81
+ - jv-ID
82
+ - ka-GE
83
+ - km-KH
84
+ - kn-IN
85
+ - ko-KR
86
+ - lv-LV
87
+ - ml-IN
88
+ - mn-MN
89
+ - ms-MY
90
+ - my-MM
91
+ - nb-NO
92
+ - nl-NL
93
+ - pl-PL
94
+ - pt-PT
95
+ - ro-RO
96
+ - ru-RU
97
+ - sl-SL
98
+ - sq-AL
99
+ - sv-SE
100
+ - sw-KE
101
+ - ta-IN
102
+ - te-IN
103
+ - th-TH
104
+ - tl-PH
105
+ - tr-TR
106
+ - ur-PK
107
+ - vi-VN
108
+ - zh-CN
109
+ - zh-TW
110
+ datasets:
111
+ - qanastek/MASSIVE
112
+ widget:
113
+ - text: "wake me up at five am this week"
114
+ - text: "je veux écouter la chanson de jacques brel encore une fois"
115
+ - text: "quiero escuchar la canción de arijit singh una vez más"
116
+ - text: "olly onde é que á um parque por perto onde eu possa correr"
117
+ - text: "פרק הבא בפודקאסט בבקשה"
118
+ - text: "亚马逊股价"
119
+ - text: "найди билет на поезд в санкт-петербург"
120
  license: cc-by-4.0
121
  ---
122
+
123
+ **People Involved**
124
+
125
+ * [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1)
126
+
127
+ **Affiliations**
128
+
129
+ 1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
130
+
131
+ ## Demo: How to use in HuggingFace Transformers Pipeline
132
+
133
+ Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
134
+
135
+ ```python
136
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
137
+
138
+ tokenizer = AutoTokenizer.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
139
+ model = AutoModelForTokenClassification.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
140
+ predict = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
141
+ res = predict("réveille-moi à neuf heures du matin le vendredi")
142
+ print(res)
143
+ ```
144
+
145
+ Outputs:
146
+
147
+ ```python
148
+ [{'word': '▁neuf', 'score': 0.9911066293716431, 'entity': 'B-time', 'index': 6, 'start': 15, 'end': 19},
149
+ {'word': '▁heures', 'score': 0.9200698733329773, 'entity': 'I-time', 'index': 7, 'start': 20, 'end': 26},
150
+ {'word': '▁du', 'score': 0.8476170897483826, 'entity': 'I-time', 'index': 8, 'start': 27, 'end': 29},
151
+ {'word': '▁matin', 'score': 0.8271021246910095, 'entity': 'I-time', 'index': 9, 'start': 30, 'end': 35},
152
+ {'word': '▁vendredi', 'score': 0.9813069701194763, 'entity': 'B-date', 'index': 11, 'start': 39, 'end': 47}]
153
+ ```
154
+
155
+ ## Training data
156
+
157
+ [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
158
+
159
+ ## Named Entities
160
+
161
+ * O
162
+ * currency_name
163
+ * personal_info
164
+ * app_name
165
+ * list_name
166
+ * alarm_type
167
+ * cooking_type
168
+ * time_zone
169
+ * media_type
170
+ * change_amount
171
+ * transport_type
172
+ * drink_type
173
+ * news_topic
174
+ * artist_name
175
+ * weather_descriptor
176
+ * transport_name
177
+ * player_setting
178
+ * email_folder
179
+ * music_album
180
+ * coffee_type
181
+ * meal_type
182
+ * song_name
183
+ * date
184
+ * movie_type
185
+ * movie_name
186
+ * game_name
187
+ * business_type
188
+ * music_descriptor
189
+ * joke_type
190
+ * music_genre
191
+ * device_type
192
+ * house_place
193
+ * place_name
194
+ * sport_type
195
+ * podcast_name
196
+ * game_type
197
+ * timeofday
198
+ * business_name
199
+ * time
200
+ * definition_word
201
+ * audiobook_author
202
+ * event_name
203
+ * general_frequency
204
+ * relation
205
+ * color_type
206
+ * audiobook_name
207
+ * food_type
208
+ * person
209
+ * transport_agency
210
+ * email_address
211
+ * podcast_descriptor
212
+ * order_type
213
+ * ingredient
214
+ * transport_descriptor
215
+ * playlist_name
216
+ * radio_name
217
+
218
+ ## Evaluation results
219
+
220
+ ```plain
221
+ precision recall f1-score support
222
+
223
+ B-alarm_type 0.8077 0.1074 0.1896 391
224
+ B-app_name 0.3954 0.5581 0.4629 559
225
+ B-artist_name 0.7424 0.8188 0.7787 7594
226
+ B-audiobook_author 0.7626 0.2751 0.4044 607
227
+ B-audiobook_name 0.7448 0.6036 0.6668 2863
228
+ B-business_name 0.7705 0.8122 0.7908 11120
229
+ B-business_type 0.6715 0.6630 0.6672 3596
230
+ B-change_amount 0.7830 0.7719 0.7774 846
231
+ B-coffee_type 0.4705 0.7298 0.5722 459
232
+ B-color_type 0.6674 0.9244 0.7751 2618
233
+ B-cooking_type 0.8630 0.4087 0.5547 986
234
+ B-currency_name 0.8851 0.9250 0.9046 4496
235
+ B-date 0.8638 0.9276 0.8946 33909
236
+ B-definition_word 0.9051 0.8398 0.8712 6474
237
+ B-device_type 0.8446 0.8407 0.8426 7459
238
+ B-drink_type 0.0000 0.0000 0.0000 127
239
+ B-email_address 0.9084 0.9684 0.9374 1075
240
+ B-email_folder 0.7464 0.9457 0.8343 663
241
+ B-event_name 0.7695 0.7648 0.7671 26880
242
+ B-food_type 0.6861 0.8487 0.7588 9047
243
+ B-game_name 0.8716 0.7366 0.7984 2866
244
+ B-general_frequency 0.7587 0.7920 0.7750 1548
245
+ B-house_place 0.9137 0.8814 0.8972 5765
246
+ B-ingredient 0.6176 0.0479 0.0890 876
247
+ B-joke_type 0.8029 0.7780 0.7903 1126
248
+ B-list_name 0.8563 0.7582 0.8043 6195
249
+ B-meal_type 0.5883 0.9148 0.7161 1948
250
+ B-media_type 0.9111 0.8302 0.8688 13301
251
+ B-movie_name 0.5312 0.3696 0.4359 230
252
+ B-movie_type 0.2829 0.3931 0.3290 290
253
+ B-music_album 0.0000 0.0000 0.0000 104
254
+ B-music_descriptor 0.2987 0.2500 0.2722 760
255
+ B-music_genre 0.7731 0.7953 0.7840 4821
256
+ B-news_topic 0.6437 0.6668 0.6551 5441
257
+ B-order_type 0.6739 0.8073 0.7346 2091
258
+ B-person 0.8290 0.9138 0.8693 25490
259
+ B-personal_info 0.6177 0.6765 0.6458 1249
260
+ B-place_name 0.8696 0.8252 0.8468 30683
261
+ B-player_setting 0.6816 0.6156 0.6469 4048
262
+ B-playlist_name 0.5858 0.4923 0.5350 1942
263
+ B-podcast_descriptor 0.7211 0.5209 0.6049 2367
264
+ B-podcast_name 0.6930 0.5462 0.6109 2091
265
+ B-radio_name 0.7304 0.7598 0.7448 4126
266
+ B-relation 0.7852 0.8708 0.8258 5689
267
+ B-song_name 0.6053 0.6909 0.6453 4131
268
+ B-sport_type 0.0000 0.0000 0.0000 0
269
+ B-time 0.8729 0.7341 0.7975 17338
270
+ B-time_zone 0.7031 0.6352 0.6674 1428
271
+ B-timeofday 0.7367 0.8168 0.7747 4853
272
+ B-transport_agency 0.8161 0.7100 0.7594 1000
273
+ B-transport_descriptor 0.8333 0.1014 0.1807 148
274
+ B-transport_name 0.7979 0.3002 0.4363 513
275
+ B-transport_type 0.9332 0.8942 0.9133 5858
276
+ B-weather_descriptor 0.8501 0.7775 0.8122 8815
277
+ I-alarm_type 1.0000 0.1500 0.2609 120
278
+ I-app_name 0.0000 0.0000 0.0000 101
279
+ I-artist_name 0.7325 0.8246 0.7758 3819
280
+ I-audiobook_author 0.8373 0.2800 0.4197 625
281
+ I-audiobook_name 0.6729 0.5007 0.5742 2227
282
+ I-business_name 0.7715 0.6326 0.6952 4265
283
+ I-business_type 0.5932 0.4661 0.5220 1004
284
+ I-change_amount 0.7876 0.8715 0.8274 817
285
+ I-coffee_type 0.9160 0.5755 0.7069 417
286
+ I-color_type 0.2781 0.1912 0.2266 272
287
+ I-cooking_type 0.0000 0.0000 0.0000 17
288
+ I-currency_name 0.8220 0.9327 0.8738 2005
289
+ I-date 0.7864 0.8482 0.8161 15957
290
+ I-definition_word 0.8707 0.6955 0.7733 1859
291
+ I-device_type 0.8837 0.8907 0.8872 4172
292
+ I-drink_type 0.0000 0.0000 0.0000 4
293
+ I-email_address 0.9701 0.9794 0.9747 2911
294
+ I-email_folder 0.6881 0.9186 0.7868 221
295
+ I-event_name 0.6499 0.5553 0.5989 11745
296
+ I-food_type 0.6568 0.7825 0.7141 3306
297
+ I-game_name 0.8183 0.6053 0.6959 1652
298
+ I-general_frequency 0.8168 0.8338 0.8252 1625
299
+ I-house_place 0.9368 0.7289 0.8199 1302
300
+ I-ingredient 0.2857 0.0070 0.0137 285
301
+ I-joke_type 0.7941 0.6117 0.6910 309
302
+ I-list_name 0.7382 0.3989 0.5179 1993
303
+ I-meal_type 0.5699 0.8174 0.6716 334
304
+ I-media_type 0.8355 0.7420 0.7860 4450
305
+ I-movie_name 0.6250 0.1244 0.2075 201
306
+ I-movie_type 0.0190 0.0541 0.0281 74
307
+ I-music_album 0.0000 0.0000 0.0000 42
308
+ I-music_descriptor 0.4231 0.5631 0.4832 293
309
+ I-music_genre 0.6716 0.5455 0.6020 1087
310
+ I-news_topic 0.6384 0.4187 0.5057 3824
311
+ I-order_type 0.4850 0.8050 0.6053 523
312
+ I-person 0.7844 0.8400 0.8113 8218
313
+ I-personal_info 0.7343 0.8363 0.7820 727
314
+ I-place_name 0.6855 0.6703 0.6778 8198
315
+ I-player_setting 0.5293 0.4982 0.5132 1361
316
+ I-playlist_name 0.4696 0.5182 0.4927 1729
317
+ I-podcast_descriptor 0.7448 0.4563 0.5659 2584
318
+ I-podcast_name 0.5923 0.5192 0.5534 1248
319
+ I-radio_name 0.7907 0.8404 0.8148 5766
320
+ I-relation 0.4455 0.4873 0.4655 788
321
+ I-song_name 0.5495 0.7433 0.6319 3120
322
+ I-sport_type 0.0000 0.0000 0.0000 0
323
+ I-time 0.8115 0.8288 0.8201 18118
324
+ I-time_zone 0.9708 0.1670 0.2850 1395
325
+ I-timeofday 0.7531 0.5734 0.6511 1287
326
+ I-transport_agency 0.6667 0.1961 0.3030 51
327
+ I-transport_descriptor 0.6429 0.0849 0.1500 106
328
+ I-transport_name 0.8469 0.5453 0.6634 497
329
+ I-transport_type 0.9556 0.7248 0.8243 505
330
+ I-weather_descriptor 0.7756 0.4046 0.5318 2887
331
+ O 0.9502 0.9560 0.9531 1031927
332
+
333
+ accuracy 0.9031 1455270
334
+ macro avg 0.6674 0.5860 0.5972 1455270
335
+ weighted avg 0.9032 0.9031 0.9013 1455270
336
+ ```
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57fb34b76edf2fb00f2c93075556fd97b000c9c9b77218001edc7c1fd63194bd
3
+ size 6814
ner_multi.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:358cb0ae12d8b1cd563f1dff6706909cbdc0af4dfb98215f62ac258aacb9e21d
3
+ size 23837329
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d52d0764f64ae24c9280aafd900699b3310d9b767179d290506b2d4dea37532
3
+ size 2220422629
predict.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
2
+
3
+ tokenizer = AutoTokenizer.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
4
+ model = AutoModelForTokenClassification.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU')
5
+ predict = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
6
+ res = predict("je veux écouter la chanson de jacques brel encore une fois")
7
+
8
+ for r in res:
9
+ print(r)
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c203ac5c59ab9397ce00b38cf31c3333bc05645c8f106682641012fce8ceee57
3
+ size 1110224689
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f1845ed1a6efb77f8a6fb6021fefb2a57a751c179e74c8a5f5d752a6e915a0e
3
+ size 15523
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd7c1ff82b8235e42c70f6367580762323a5262ede689b6bba1eb9de505cd572
3
+ size 623
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:378eb3bf733eb16e65792d7e3fda5b8a4631387ca04d2015199c4d4f22ae554d
3
+ size 239
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62c24cdc13d4c9952d63718d6c9fa4c287974249e16b7ade6d5a85e7bbb75626
3
+ size 17082660
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a29b4422dd3e2b3311e0b5026f27f884f4c0b0ca566e8cd598025cf873f493d
3
+ size 398
trainer_state.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93c1592abdb2b725ef991f71a25c39399a6f5bb49185f538d5d0c4ae8caed2e9
3
+ size 1246
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d8ae958f7716bb49af9afe09710a29a5c5194a231a2c502398560bfda608998
3
+ size 3055