msislam
/

code-mixed-language-detection-XLMRoberta

Token Classification

Inference Endpoints

Model card Files Files and versions Community

msislam commited on Jul 2, 2023

Commit

10dab53

•

1 Parent(s): 50f5394

Update readme

Files changed (1) hide show

README.md +47 -3

README.md CHANGED Viewed

@@ -9,10 +9,54 @@ language:
 metrics:
 - seqeval
 widget:
 - text: >-
     Hallo, Guten Tag! how are you?
 - text: >-
     Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
-- text: >-
-    Hala Madrid y nada más. It means Hala Madrid and nothing more.
----

 metrics:
 - seqeval
 widget:
+- text: >-
+    Hala Madrid y nada más. It means Go Madrid and nothing more.
 - text: >-
     Hallo, Guten Tag! how are you?
 - text: >-
     Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
+---
+# Code-Mixed Language Detection using XLM-RoBERTa
+## Description
+This model detects Languages with its boundary by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages.
+## Training Dataset
+The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).
+## Results
+```
+'DE': {'precision': 0.9870741390453328, 'recall': 0.9883516686696866, 'f1': 0.9877124907612713}
+'EN': {'precision': 0.9901617633147289, 'recall': 0.9914748508098892, 'f1': 0.9908178720181748}
+'ES': {'precision': 0.9912407007439404, 'recall': 0.9912407007439404, 'f1': 0.9912407007439406}
+'FR': {'precision': 0.9872469872469872, 'recall': 0.9871314927468414, 'f1': 0.9871892366188945}
+'overall_precision': 0.9888723454274744
+'overall_recall': 0.9895702634880803
+'overall_f1': 0.9892211813585232
+'overall_accuracy': 0.9993651810717168
+```
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
+model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
+text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'
+tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt")
+with torch.no_grad():
+  logits = model(**inputs).logits
+labels_predicted = logits.argmax(-1)
+lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]]
+lang_tag_predicted
+```