qanastek
/

51-languages-classifier

@@ -128,6 +128,12 @@ license: cc-by-4.0
 1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
 ## Demo: How to use in HuggingFace Transformers Pipeline
 Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
@@ -145,13 +151,68 @@ print(res)
 Outputs:
 ```python
-[{'label': 'fr-FR', 'score': 0.9998375177383423}]
 ```
 ## Training data
 [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
 ## Evaluation results

 1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
+## Model
+XLM-Roberta : [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
+Paper : [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf)
 ## Demo: How to use in HuggingFace Transformers Pipeline
 Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
 Outputs:
 ```python
+[{'label': 'he-IL', 'score': 0.9998375177383423}]
 ```
 ## Training data
 [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
+### Languages
+Thee model is capable of distinguish 51 languages :
+- `Afrikaans - South Africa (af-ZA)`
+- `Amharic - Ethiopia (am-ET)`
+- `Arabic - Saudi Arabia (ar-SA)`
+- `Azeri - Azerbaijan (az-AZ)`
+- `Bengali - Bangladesh (bn-BD)`
+- `Chinese - China (zh-CN)`
+- `Chinese - Taiwan (zh-TW)`
+- `Danish - Denmark (da-DK)`
+- `German - Germany (de-DE)`
+- `Greek - Greece (el-GR)`
+- `English - United States (en-US)`
+- `Spanish - Spain (es-ES)`
+- `Farsi - Iran (fa-IR)`
+- `Finnish - Finland (fi-FI)`
+- `French - France (fr-FR)`
+- `Hebrew - Israel (he-IL)`
+- `Hungarian - Hungary (hu-HU)`
+- `Armenian - Armenia (hy-AM)`
+- `Indonesian - Indonesia (id-ID)`
+- `Icelandic - Iceland (is-IS)`
+- `Italian - Italy (it-IT)`
+- `Japanese - Japan (ja-JP)`
+- `Javanese - Indonesia (jv-ID)`
+- `Georgian - Georgia (ka-GE)`
+- `Khmer - Cambodia (km-KH)`
+- `Korean - Korea (ko-KR)`
+- `Latvian - Latvia (lv-LV)`
+- `Mongolian - Mongolia (mn-MN)`
+- `Malay - Malaysia (ms-MY)`
+- `Burmese - Myanmar (my-MM)`
+- `Norwegian - Norway (nb-NO)`
+- `Dutch - Netherlands (nl-NL)`
+- `Polish - Poland (pl-PL)`
+- `Portuguese - Portugal (pt-PT)`
+- `Romanian - Romania (ro-RO)`
+- `Russian - Russia (ru-RU)`
+- `Slovanian - Slovania (sl-SL)`
+- `Albanian - Albania (sq-AL)`
+- `Swedish - Sweden (sv-SE)`
+- `Swahili - Kenya (sw-KE)`
+- `Hindi - India (hi-IN)`
+- `Kannada - India (kn-IN)`
+- `Malayalam - India (ml-IN)`
+- `Tamil - India (ta-IN)`
+- `Telugu - India (te-IN)`
+- `Thai - Thailand (th-TH)`
+- `Tagalog - Philippines (tl-PH)`
+- `Turkish - Turkey (tr-TR)`
+- `Urdu - Pakistan (ur-PK)`
+- `Vietnamese - Vietnam (vi-VN)`
+- `Welsh - United Kingdom (cy-GB)`
 ## Evaluation results

config.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8b5e717ed1222ea2d1da259d79d0f844cbc139a1e5ba25387bc8c2c640b20668
-size 2912

 version https://git-lfs.github.com/spec/v1
+oid sha256:cc8e70262f68a7555aed1c9836f1226de164e611212f23703995b6515127935d
+size 2626

test.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
+model_name = 'qanastek/51-languages-classifier'
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
+res = classifier("פרק הבא בפודקאסט בבקשה")
+print(res)