Code-Mixed Language Detection using XLM-RoBERTa

Description

This model detects languages in a Code-Mixed text with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on xlm-roberta-base.

Training Dataset

The training dataset is based on The Multilingual Amazon Reviews Corpus. The preprocessed dataset that has been used to train, validate, and test this model can be found here.

Results

'DE': {'precision': 0.9870741390453328,
       'recall': 0.9883516686696866,
       'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
       'recall': 0.9914748508098892,
       'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
       'recall': 0.9912407007439404,
       'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
       'recall': 0.9871314927468414,
       'f1': 0.9871892366188945}

'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168

Codes

The codes associated with the model can be found in this GitHub Repo.

Usage

The model can be used as follows:

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'

inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

labels_predicted = logits.argmax(-1)

lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted

Limitations

The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are:

The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language.
Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted.
The prediction also depends on punctuation.

msislam
/

code-mixed-language-detection-XLMRoberta