--- datasets: - msislam/marc-code-mixed-small language: - de - en - es - fr metrics: - seqeval widget: - text: >- Hala Madrid y nada más. It means Go Madrid and nothing more. - text: >- Hallo, Guten Tag! how are you? - text: >- Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks. --- # Code-Mixed Language Detection using XLM-RoBERTa ## Description This model detects languages in a Code-Mixed text with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). ## Training Dataset The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small). ## Results ```python 'DE': {'precision': 0.9870741390453328, 'recall': 0.9883516686696866, 'f1': 0.9877124907612713} 'EN': {'precision': 0.9901617633147289, 'recall': 0.9914748508098892, 'f1': 0.9908178720181748} 'ES': {'precision': 0.9912407007439404, 'recall': 0.9912407007439404, 'f1': 0.9912407007439406} 'FR': {'precision': 0.9872469872469872, 'recall': 0.9871314927468414, 'f1': 0.9871892366188945} 'overall_precision': 0.9888723454274744 'overall_recall': 0.9895702634880803 'overall_f1': 0.9892211813585232 'overall_accuracy': 0.9993651810717168 ``` ## Codes The codes associated with the model can be found in this [GitHub Repo](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model). ## Usage The model can be used as follows: ```python import torch from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.' inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits labels_predicted = logits.argmax(-1) lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]] lang_tag_predicted ``` ## Limitations The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are: * The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language. * Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted. * The prediction also depends on punctuation.