qanastek commited on
Commit
04d0640
1 Parent(s): 21e0ca7

Update config.json

Browse files
Files changed (3) hide show
  1. README.md +62 -1
  2. config.json +2 -2
  3. test.py +7 -0
README.md CHANGED
@@ -128,6 +128,12 @@ license: cc-by-4.0
128
 
129
  1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
130
 
 
 
 
 
 
 
131
  ## Demo: How to use in HuggingFace Transformers Pipeline
132
 
133
  Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
@@ -145,13 +151,68 @@ print(res)
145
  Outputs:
146
 
147
  ```python
148
- [{'label': 'fr-FR', 'score': 0.9998375177383423}]
149
  ```
150
 
151
  ## Training data
152
 
153
  [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
154
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  ## Evaluation results
157
 
128
 
129
  1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France.
130
 
131
+ ## Model
132
+
133
+ XLM-Roberta : [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
134
+
135
+ Paper : [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf)
136
+
137
  ## Demo: How to use in HuggingFace Transformers Pipeline
138
 
139
  Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers```
151
  Outputs:
152
 
153
  ```python
154
+ [{'label': 'he-IL', 'score': 0.9998375177383423}]
155
  ```
156
 
157
  ## Training data
158
 
159
  [MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
160
 
161
+ ### Languages
162
+
163
+ Thee model is capable of distinguish 51 languages :
164
+
165
+ - `Afrikaans - South Africa (af-ZA)`
166
+ - `Amharic - Ethiopia (am-ET)`
167
+ - `Arabic - Saudi Arabia (ar-SA)`
168
+ - `Azeri - Azerbaijan (az-AZ)`
169
+ - `Bengali - Bangladesh (bn-BD)`
170
+ - `Chinese - China (zh-CN)`
171
+ - `Chinese - Taiwan (zh-TW)`
172
+ - `Danish - Denmark (da-DK)`
173
+ - `German - Germany (de-DE)`
174
+ - `Greek - Greece (el-GR)`
175
+ - `English - United States (en-US)`
176
+ - `Spanish - Spain (es-ES)`
177
+ - `Farsi - Iran (fa-IR)`
178
+ - `Finnish - Finland (fi-FI)`
179
+ - `French - France (fr-FR)`
180
+ - `Hebrew - Israel (he-IL)`
181
+ - `Hungarian - Hungary (hu-HU)`
182
+ - `Armenian - Armenia (hy-AM)`
183
+ - `Indonesian - Indonesia (id-ID)`
184
+ - `Icelandic - Iceland (is-IS)`
185
+ - `Italian - Italy (it-IT)`
186
+ - `Japanese - Japan (ja-JP)`
187
+ - `Javanese - Indonesia (jv-ID)`
188
+ - `Georgian - Georgia (ka-GE)`
189
+ - `Khmer - Cambodia (km-KH)`
190
+ - `Korean - Korea (ko-KR)`
191
+ - `Latvian - Latvia (lv-LV)`
192
+ - `Mongolian - Mongolia (mn-MN)`
193
+ - `Malay - Malaysia (ms-MY)`
194
+ - `Burmese - Myanmar (my-MM)`
195
+ - `Norwegian - Norway (nb-NO)`
196
+ - `Dutch - Netherlands (nl-NL)`
197
+ - `Polish - Poland (pl-PL)`
198
+ - `Portuguese - Portugal (pt-PT)`
199
+ - `Romanian - Romania (ro-RO)`
200
+ - `Russian - Russia (ru-RU)`
201
+ - `Slovanian - Slovania (sl-SL)`
202
+ - `Albanian - Albania (sq-AL)`
203
+ - `Swedish - Sweden (sv-SE)`
204
+ - `Swahili - Kenya (sw-KE)`
205
+ - `Hindi - India (hi-IN)`
206
+ - `Kannada - India (kn-IN)`
207
+ - `Malayalam - India (ml-IN)`
208
+ - `Tamil - India (ta-IN)`
209
+ - `Telugu - India (te-IN)`
210
+ - `Thai - Thailand (th-TH)`
211
+ - `Tagalog - Philippines (tl-PH)`
212
+ - `Turkish - Turkey (tr-TR)`
213
+ - `Urdu - Pakistan (ur-PK)`
214
+ - `Vietnamese - Vietnam (vi-VN)`
215
+ - `Welsh - United Kingdom (cy-GB)`
216
 
217
  ## Evaluation results
218
 
config.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8b5e717ed1222ea2d1da259d79d0f844cbc139a1e5ba25387bc8c2c640b20668
3
- size 2912
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc8e70262f68a7555aed1c9836f1226de164e611212f23703995b6515127935d
3
+ size 2626
test.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
2
+ model_name = 'qanastek/51-languages-classifier'
3
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
4
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
5
+ classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
6
+ res = classifier("פרק הבא בפודקאסט בבקשה")
7
+ print(res)