--- tags: - Transformers - Token Classification - Slot Annotation languages: - af-ZA - am-ET - ar-SA - az-AZ - bn-BD - cy-GB - da-DK - de-DE - el-GR - en-US - es-ES - fa-IR - fi-FI - fr-FR - he-IL - hi-IN - hu-HU - hy-AM - id-ID - is-IS - it-IT - ja-JP - jv-ID - ka-GE - km-KH - kn-IN - ko-KR - lv-LV - ml-IN - mn-MN - ms-MY - my-MM - nb-NO - nl-NL - pl-PL - pt-PT - ro-RO - ru-RU - sl-SL - sq-AL - sv-SE - sw-KE - ta-IN - te-IN - th-TH - tl-PH - tr-TR - ur-PK - vi-VN - zh-CN - zh-TW multilinguality: - af-ZA - am-ET - ar-SA - az-AZ - bn-BD - cy-GB - da-DK - de-DE - el-GR - en-US - es-ES - fa-IR - fi-FI - fr-FR - he-IL - hi-IN - hu-HU - hy-AM - id-ID - is-IS - it-IT - ja-JP - jv-ID - ka-GE - km-KH - kn-IN - ko-KR - lv-LV - ml-IN - mn-MN - ms-MY - my-MM - nb-NO - nl-NL - pl-PL - pt-PT - ro-RO - ru-RU - sl-SL - sq-AL - sv-SE - sw-KE - ta-IN - te-IN - th-TH - tl-PH - tr-TR - ur-PK - vi-VN - zh-CN - zh-TW datasets: - qanastek/MASSIVE widget: - text: "wake me up at five am this week" - text: "je veux écouter la chanson de jacques brel encore une fois" - text: "quiero escuchar la canción de arijit singh una vez más" - text: "olly onde é que á um parque por perto onde eu possa correr" - text: "פרק הבא בפודקאסט בבקשה" - text: "亚马逊股价" - text: "найди билет на поезд в санкт-петербург" license: cc-by-4.0 --- **People Involved** * [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1) **Affiliations** 1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France. ## Demo: How to use in HuggingFace Transformers Pipeline Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers``` ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline tokenizer = AutoTokenizer.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU') model = AutoModelForTokenClassification.from_pretrained('qanastek/XLMRoberta-Alexa-Intents-NER-NLU') predict = TokenClassificationPipeline(model=model, tokenizer=tokenizer) res = predict("réveille-moi à neuf heures du matin le vendredi") print(res) ``` Outputs: ```python [{'word': '▁neuf', 'score': 0.9911066293716431, 'entity': 'B-time', 'index': 6, 'start': 15, 'end': 19}, {'word': '▁heures', 'score': 0.9200698733329773, 'entity': 'I-time', 'index': 7, 'start': 20, 'end': 26}, {'word': '▁du', 'score': 0.8476170897483826, 'entity': 'I-time', 'index': 8, 'start': 27, 'end': 29}, {'word': '▁matin', 'score': 0.8271021246910095, 'entity': 'I-time', 'index': 9, 'start': 30, 'end': 35}, {'word': '▁vendredi', 'score': 0.9813069701194763, 'entity': 'B-date', 'index': 11, 'start': 39, 'end': 47}] ``` ## Training data [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. ## Named Entities * O * currency_name * personal_info * app_name * list_name * alarm_type * cooking_type * time_zone * media_type * change_amount * transport_type * drink_type * news_topic * artist_name * weather_descriptor * transport_name * player_setting * email_folder * music_album * coffee_type * meal_type * song_name * date * movie_type * movie_name * game_name * business_type * music_descriptor * joke_type * music_genre * device_type * house_place * place_name * sport_type * podcast_name * game_type * timeofday * business_name * time * definition_word * audiobook_author * event_name * general_frequency * relation * color_type * audiobook_name * food_type * person * transport_agency * email_address * podcast_descriptor * order_type * ingredient * transport_descriptor * playlist_name * radio_name ## Evaluation results ```plain precision recall f1-score support B-alarm_type 0.8077 0.1074 0.1896 391 B-app_name 0.3954 0.5581 0.4629 559 B-artist_name 0.7424 0.8188 0.7787 7594 B-audiobook_author 0.7626 0.2751 0.4044 607 B-audiobook_name 0.7448 0.6036 0.6668 2863 B-business_name 0.7705 0.8122 0.7908 11120 B-business_type 0.6715 0.6630 0.6672 3596 B-change_amount 0.7830 0.7719 0.7774 846 B-coffee_type 0.4705 0.7298 0.5722 459 B-color_type 0.6674 0.9244 0.7751 2618 B-cooking_type 0.8630 0.4087 0.5547 986 B-currency_name 0.8851 0.9250 0.9046 4496 B-date 0.8638 0.9276 0.8946 33909 B-definition_word 0.9051 0.8398 0.8712 6474 B-device_type 0.8446 0.8407 0.8426 7459 B-drink_type 0.0000 0.0000 0.0000 127 B-email_address 0.9084 0.9684 0.9374 1075 B-email_folder 0.7464 0.9457 0.8343 663 B-event_name 0.7695 0.7648 0.7671 26880 B-food_type 0.6861 0.8487 0.7588 9047 B-game_name 0.8716 0.7366 0.7984 2866 B-general_frequency 0.7587 0.7920 0.7750 1548 B-house_place 0.9137 0.8814 0.8972 5765 B-ingredient 0.6176 0.0479 0.0890 876 B-joke_type 0.8029 0.7780 0.7903 1126 B-list_name 0.8563 0.7582 0.8043 6195 B-meal_type 0.5883 0.9148 0.7161 1948 B-media_type 0.9111 0.8302 0.8688 13301 B-movie_name 0.5312 0.3696 0.4359 230 B-movie_type 0.2829 0.3931 0.3290 290 B-music_album 0.0000 0.0000 0.0000 104 B-music_descriptor 0.2987 0.2500 0.2722 760 B-music_genre 0.7731 0.7953 0.7840 4821 B-news_topic 0.6437 0.6668 0.6551 5441 B-order_type 0.6739 0.8073 0.7346 2091 B-person 0.8290 0.9138 0.8693 25490 B-personal_info 0.6177 0.6765 0.6458 1249 B-place_name 0.8696 0.8252 0.8468 30683 B-player_setting 0.6816 0.6156 0.6469 4048 B-playlist_name 0.5858 0.4923 0.5350 1942 B-podcast_descriptor 0.7211 0.5209 0.6049 2367 B-podcast_name 0.6930 0.5462 0.6109 2091 B-radio_name 0.7304 0.7598 0.7448 4126 B-relation 0.7852 0.8708 0.8258 5689 B-song_name 0.6053 0.6909 0.6453 4131 B-sport_type 0.0000 0.0000 0.0000 0 B-time 0.8729 0.7341 0.7975 17338 B-time_zone 0.7031 0.6352 0.6674 1428 B-timeofday 0.7367 0.8168 0.7747 4853 B-transport_agency 0.8161 0.7100 0.7594 1000 B-transport_descriptor 0.8333 0.1014 0.1807 148 B-transport_name 0.7979 0.3002 0.4363 513 B-transport_type 0.9332 0.8942 0.9133 5858 B-weather_descriptor 0.8501 0.7775 0.8122 8815 I-alarm_type 1.0000 0.1500 0.2609 120 I-app_name 0.0000 0.0000 0.0000 101 I-artist_name 0.7325 0.8246 0.7758 3819 I-audiobook_author 0.8373 0.2800 0.4197 625 I-audiobook_name 0.6729 0.5007 0.5742 2227 I-business_name 0.7715 0.6326 0.6952 4265 I-business_type 0.5932 0.4661 0.5220 1004 I-change_amount 0.7876 0.8715 0.8274 817 I-coffee_type 0.9160 0.5755 0.7069 417 I-color_type 0.2781 0.1912 0.2266 272 I-cooking_type 0.0000 0.0000 0.0000 17 I-currency_name 0.8220 0.9327 0.8738 2005 I-date 0.7864 0.8482 0.8161 15957 I-definition_word 0.8707 0.6955 0.7733 1859 I-device_type 0.8837 0.8907 0.8872 4172 I-drink_type 0.0000 0.0000 0.0000 4 I-email_address 0.9701 0.9794 0.9747 2911 I-email_folder 0.6881 0.9186 0.7868 221 I-event_name 0.6499 0.5553 0.5989 11745 I-food_type 0.6568 0.7825 0.7141 3306 I-game_name 0.8183 0.6053 0.6959 1652 I-general_frequency 0.8168 0.8338 0.8252 1625 I-house_place 0.9368 0.7289 0.8199 1302 I-ingredient 0.2857 0.0070 0.0137 285 I-joke_type 0.7941 0.6117 0.6910 309 I-list_name 0.7382 0.3989 0.5179 1993 I-meal_type 0.5699 0.8174 0.6716 334 I-media_type 0.8355 0.7420 0.7860 4450 I-movie_name 0.6250 0.1244 0.2075 201 I-movie_type 0.0190 0.0541 0.0281 74 I-music_album 0.0000 0.0000 0.0000 42 I-music_descriptor 0.4231 0.5631 0.4832 293 I-music_genre 0.6716 0.5455 0.6020 1087 I-news_topic 0.6384 0.4187 0.5057 3824 I-order_type 0.4850 0.8050 0.6053 523 I-person 0.7844 0.8400 0.8113 8218 I-personal_info 0.7343 0.8363 0.7820 727 I-place_name 0.6855 0.6703 0.6778 8198 I-player_setting 0.5293 0.4982 0.5132 1361 I-playlist_name 0.4696 0.5182 0.4927 1729 I-podcast_descriptor 0.7448 0.4563 0.5659 2584 I-podcast_name 0.5923 0.5192 0.5534 1248 I-radio_name 0.7907 0.8404 0.8148 5766 I-relation 0.4455 0.4873 0.4655 788 I-song_name 0.5495 0.7433 0.6319 3120 I-sport_type 0.0000 0.0000 0.0000 0 I-time 0.8115 0.8288 0.8201 18118 I-time_zone 0.9708 0.1670 0.2850 1395 I-timeofday 0.7531 0.5734 0.6511 1287 I-transport_agency 0.6667 0.1961 0.3030 51 I-transport_descriptor 0.6429 0.0849 0.1500 106 I-transport_name 0.8469 0.5453 0.6634 497 I-transport_type 0.9556 0.7248 0.8243 505 I-weather_descriptor 0.7756 0.4046 0.5318 2887 O 0.9502 0.9560 0.9531 1031927 accuracy 0.9031 1455270 macro avg 0.6674 0.5860 0.5972 1455270 weighted avg 0.9032 0.9031 0.9013 1455270 ```