Update readme
Browse files
README.md
CHANGED
@@ -9,10 +9,54 @@ language:
|
|
9 |
metrics:
|
10 |
- seqeval
|
11 |
widget:
|
|
|
|
|
12 |
- text: >-
|
13 |
Hallo, Guten Tag! how are you?
|
14 |
- text: >-
|
15 |
Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
|
16 |
-
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
metrics:
|
10 |
- seqeval
|
11 |
widget:
|
12 |
+
- text: >-
|
13 |
+
Hala Madrid y nada más. It means Go Madrid and nothing more.
|
14 |
- text: >-
|
15 |
Hallo, Guten Tag! how are you?
|
16 |
- text: >-
|
17 |
Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
|
18 |
+
---
|
19 |
+
|
20 |
+
# Code-Mixed Language Detection using XLM-RoBERTa
|
21 |
+
|
22 |
+
## Description
|
23 |
+
This model detects Languages with its boundary by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages.
|
24 |
+
|
25 |
+
## Training Dataset
|
26 |
+
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).
|
27 |
+
|
28 |
+
## Results
|
29 |
+
|
30 |
+
```
|
31 |
+
'DE': {'precision': 0.9870741390453328, 'recall': 0.9883516686696866, 'f1': 0.9877124907612713}
|
32 |
+
'EN': {'precision': 0.9901617633147289, 'recall': 0.9914748508098892, 'f1': 0.9908178720181748}
|
33 |
+
'ES': {'precision': 0.9912407007439404, 'recall': 0.9912407007439404, 'f1': 0.9912407007439406}
|
34 |
+
'FR': {'precision': 0.9872469872469872, 'recall': 0.9871314927468414, 'f1': 0.9871892366188945}
|
35 |
+
|
36 |
+
'overall_precision': 0.9888723454274744
|
37 |
+
'overall_recall': 0.9895702634880803
|
38 |
+
'overall_f1': 0.9892211813585232
|
39 |
+
'overall_accuracy': 0.9993651810717168
|
40 |
+
```
|
41 |
+
|
42 |
+
## Usage
|
43 |
+
|
44 |
+
```python
|
45 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
46 |
+
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
|
48 |
+
|
49 |
+
model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
|
50 |
+
|
51 |
+
text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'
|
52 |
+
|
53 |
+
tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt")
|
54 |
+
|
55 |
+
with torch.no_grad():
|
56 |
+
logits = model(**inputs).logits
|
57 |
+
|
58 |
+
labels_predicted = logits.argmax(-1)
|
59 |
+
|
60 |
+
lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]]
|
61 |
+
lang_tag_predicted
|
62 |
+
```
|