msislam commited on
Commit
10dab53
1 Parent(s): 50f5394

Update readme

Browse files
Files changed (1) hide show
  1. README.md +47 -3
README.md CHANGED
@@ -9,10 +9,54 @@ language:
9
  metrics:
10
  - seqeval
11
  widget:
 
 
12
  - text: >-
13
  Hallo, Guten Tag! how are you?
14
  - text: >-
15
  Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
16
- - text: >-
17
- Hala Madrid y nada más. It means Hala Madrid and nothing more.
18
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  metrics:
10
  - seqeval
11
  widget:
12
+ - text: >-
13
+ Hala Madrid y nada más. It means Go Madrid and nothing more.
14
  - text: >-
15
  Hallo, Guten Tag! how are you?
16
  - text: >-
17
  Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
18
+ ---
19
+
20
+ # Code-Mixed Language Detection using XLM-RoBERTa
21
+
22
+ ## Description
23
+ This model detects Languages with its boundary by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages.
24
+
25
+ ## Training Dataset
26
+ The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).
27
+
28
+ ## Results
29
+
30
+ ```
31
+ 'DE': {'precision': 0.9870741390453328, 'recall': 0.9883516686696866, 'f1': 0.9877124907612713}
32
+ 'EN': {'precision': 0.9901617633147289, 'recall': 0.9914748508098892, 'f1': 0.9908178720181748}
33
+ 'ES': {'precision': 0.9912407007439404, 'recall': 0.9912407007439404, 'f1': 0.9912407007439406}
34
+ 'FR': {'precision': 0.9872469872469872, 'recall': 0.9871314927468414, 'f1': 0.9871892366188945}
35
+
36
+ 'overall_precision': 0.9888723454274744
37
+ 'overall_recall': 0.9895702634880803
38
+ 'overall_f1': 0.9892211813585232
39
+ 'overall_accuracy': 0.9993651810717168
40
+ ```
41
+
42
+ ## Usage
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
48
+
49
+ model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
50
+
51
+ text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'
52
+
53
+ tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt")
54
+
55
+ with torch.no_grad():
56
+ logits = model(**inputs).logits
57
+
58
+ labels_predicted = logits.argmax(-1)
59
+
60
+ lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]]
61
+ lang_tag_predicted
62
+ ```