File size: 4,952 Bytes
0610940 b4fa923 0610940 7b8270e 0610940 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: mit
language:
- multilingual
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: token-classification
---
# Multilingual Identification of English Code-Switching
AnE-LID (Any-English Code-Switching Language Identification) is a token-level model for detecting English code-switching in multilingual texts. It classifies words into four classes: `English`, `notEnglish`, `Mixed`, and `Other`. The model shows strong performance on both languages seen and unseen in the training data.
# Usage
You can use AnE-LID with Huggingface’s `pipeline` or `AutoModelForTokenClassification`.
Let's try the following example (taken from [this](https://aclanthology.org/2023.calcs-1.1/) paper)
```python
input = "ich glaub ich muss echt rewatchen like i feel so empty was soll ich denn jetzt machen?"
```
## Pipeline
```python
from transformers import pipeline
classifier = pipeline("token-classification", model="igorsterner/AnE-LID", aggregation_strategy="simple")
result = classifier(input)
```
which returns
```
[{'entity_group': 'notEnglish',
'score': 0.9999998,
'word': 'ich glaub ich muss echt',
'start': 0,
'end': 23},
{'entity_group': 'Mixed',
'score': 0.9999941,
'word': 'rewatchen',
'start': 24,
'end': 33},
{'entity_group': 'English',
'score': 0.99999154,
'word': 'like i feel so empty',
'start': 34,
'end': 54},
{'entity_group': 'notEnglish',
'score': 0.9292571,
'word': 'was soll ich denn jetzt machen?',
'start': 55,
'end': 86}]
```
## Advanced
If your input is already word-tokenized, and you want the corresponding word language labels, you can try the following strategy
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
lid_model_name = "igorsterner/AnE-LID"
lid_tokenizer = AutoTokenizer.from_pretrained(lid_model_name)
lid_model = AutoModelForTokenClassification.from_pretrained(lid_model_name)
word_tokens = ['ich', 'glaub', 'ich', 'muss', 'echt', 'rewatchen', 'like', 'i', 'feel', 'so', 'empty', 'was', 'soll', 'ich', 'denn', 'jetzt', 'machen', '?']
subword_inputs = lid_tokenizer(
word_tokens, truncation=True, is_split_into_words=True, return_tensors="pt"
)
subword2word = subword_inputs.word_ids(batch_index=0)
logits = lid_model(**subword_inputs).logits
predictions = torch.argmax(logits, dim=2)
predicted_subword_labels = [lid_model.config.id2label[t.item()] for t in predictions[0]]
predicted_word_labels = [[] for _ in range(len(word_tokens))]
for idx, predicted_subword in enumerate(predicted_subword_labels):
if subword2word[idx] is not None:
predicted_word_labels[subword2word[idx]].append(predicted_subword)
def most_frequent(lst):
return max(set(lst), key=lst.count) if lst else "Other"
predicted_word_labels = [most_frequent(sublist) for sublist in predicted_word_labels]
for token, label in zip(word_tokens, predicted_word_labels):
print(f"{token}: {label}")
```
which returns
```
ich: notEnglish
glaub: notEnglish
ich: notEnglish
muss: notEnglish
echt: notEnglish
rewatchen: Mixed
like: English
i: English
feel: English
so: English
empty: English
was: notEnglish
soll: notEnglish
ich: notEnglish
denn: notEnglish
jetzt: notEnglish
machen: notEnglish
?: Other
```
# Named entities
If you also want to tag named entities, you can also run [AnE-NER](https://huggingface.co/igorsterner/ane-lid). Checkout my evaluation scripts for examples on using both at the same time, as we did in the paper: [https://github.com/igorsterner/AnE/tree/main/eval](https://github.com/igorsterner/AnE/tree/main/eval).
# Citation
Please consider citing my work if it helped you
```
@inproceedings{sterner-2024-multilingual,
title = "Multilingual Identification of {E}nglish Code-Switching",
author = "Sterner, Igor",
editor = {Scherrer, Yves and
Jauhiainen, Tommi and
Ljube{\v{s}}i{\'c}, Nikola and
Zampieri, Marcos and
Nakov, Preslav and
Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.vardial-1.14",
doi = "10.18653/v1/2024.vardial-1.14",
pages = "163--173",
abstract = "Code-switching research depends on fine-grained language identification. In this work, we study existing corpora used to train token-level language identification systems. We aggregate these corpora with a consistent labelling scheme and train a system to identify English code-switching in multilingual text. We show that the system identifies code-switching in unseen language pairs with absolute measure 2.3-4.6{\%} better than language-pair-specific SoTA. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching.",
}
``` |