---
datasets:
- msislam/marc-code-mixed-small
language:
- de
- en
- es
- fr
metrics:
- seqeval
widget:
- text: >- 
    Hala Madrid y nada más. It means Go Madrid and nothing more.
- text: >- 
    Hallo, Guten Tag! how are you?
- text: >- 
    Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
---

# Code-Mixed Language Detection using XLM-RoBERTa

## Description
This model detects languages in a Code-Mixed text with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).

## Training Dataset
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).

## Results

```python
'DE': {'precision': 0.9870741390453328,
       'recall': 0.9883516686696866,
       'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
       'recall': 0.9914748508098892,
       'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
       'recall': 0.9912407007439404,
       'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
       'recall': 0.9871314927468414,
       'f1': 0.9871892366188945}

'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168
```

## Codes

The codes associated with the model can be found in this [GitHub Repo](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model).

## Usage

The model can be used as follows:

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'

inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

labels_predicted = logits.argmax(-1)

lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted
```

## Limitations
The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are:
* The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language.
* Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted.
* The prediction also depends on punctuation.