File size: 2,969 Bytes
79ab19c
 
fd49efe
79ab19c
 
 
 
 
 
 
018e931
10dab53
 
018e931
50f5394
 
 
10dab53
 
 
 
 
776e802
10dab53
 
bd200de
10dab53
 
 
1290240
3efb6ba
1290240
 
 
 
 
 
 
 
 
 
 
10dab53
 
 
 
3efb6ba
10dab53
 
4babf95
 
6db1465
4babf95
10dab53
 
3efb6ba
 
10dab53
da4d090
10dab53
 
 
 
 
 
 
 
da4d090
10dab53
 
 
 
 
 
da4d090
10dab53
 
3633d78
 
 
5c37b23
3633d78
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
datasets:
- msislam/marc-code-mixed-small
language:
- de
- en
- es
- fr
metrics:
- seqeval
widget:
- text: >- 
    Hala Madrid y nada más. It means Go Madrid and nothing more.
- text: >- 
    Hallo, Guten Tag! how are you?
- text: >- 
    Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
---

# Code-Mixed Language Detection using XLM-RoBERTa

## Description
This model detects languages in a Code-Mixed text with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).

## Training Dataset
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).

## Results

```python
'DE': {'precision': 0.9870741390453328,
       'recall': 0.9883516686696866,
       'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
       'recall': 0.9914748508098892,
       'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
       'recall': 0.9912407007439404,
       'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
       'recall': 0.9871314927468414,
       'f1': 0.9871892366188945}

'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168
```

## Codes

The codes associated with the model can be found in this [GitHub Repo](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model).

## Usage

The model can be used as follows:

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'

inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

labels_predicted = logits.argmax(-1)

lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted
```

## Limitations
The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are:
* The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language.
* Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted.
* The prediction also depends on punctuation.