Edit model card

Overview

This model supports the detection of 45 languages, and it's fine-tuned using multilingual-e5-base model on the common-language dataset.
The overall accuracy is 98.37%, and more evaluation results are shown the below.

Download the model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection')
model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45)

Example of language detection

import torch

languages = [
    "Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong", 
    "Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English", 
    "Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek", 
    "Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle", 
    "Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish", 
    "Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian", 
    "Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh"
]

def predict(text, model, tokenizer, device = torch.device('cpu')):
    model.to(device)
    model.eval()
    tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt")
    input_ids = tokenized['input_ids']
    attention_mask = tokenized['attention_mask']
    with torch.no_grad():
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    return probabilities

def get_topk(probabilities, languages, k=3):
    topk_prob, topk_indices = torch.topk(probabilities, k)
    topk_prob = topk_prob.cpu().numpy()[0].tolist()
    topk_indices = topk_indices.cpu().numpy()[0].tolist()
    topk_labels = [languages[index] for index in topk_indices]
    return topk_prob, topk_labels

text = "你的測試句子"
probabilities = predict(text, model, tokenizer)
topk_prob, topk_labels = get_topk(probabilities, languages)
print(topk_prob, topk_labels)

# [0.999620258808, 0.00025940246996469, 2.7690215574693e-05]
# ['Chinese_Taiwan', 'Chinese_Hongkong', 'Chinese_China']

Evaluation Results

The test datasets refers to the common_language test datasets.

index language precision recall f1-score support
0 Arabic 1.00 1.00 1.00 151
1 Basque 0.99 1.00 1.00 111
2 Breton 1.00 0.90 0.95 252
3 Catalan 0.96 0.99 0.97 96
4 Chinese_China 0.98 1.00 0.99 100
5 Chinese_Hongkong 0.97 0.87 0.92 115
6 Chinese_Taiwan 0.92 0.98 0.95 170
7 Chuvash 0.98 1.00 0.99 137
8 Czech 0.98 1.00 0.99 128
9 Dhivehi 1.00 1.00 1.00 111
10 Dutch 0.99 1.00 0.99 144
11 English 0.96 1.00 0.98 98
12 Esperanto 0.98 0.98 0.98 107
13 Estonian 1.00 0.99 0.99 93
14 French 0.95 1.00 0.98 106
15 Frisian 1.00 0.98 0.99 117
16 Georgian 1.00 1.00 1.00 110
17 German 1.00 1.00 1.00 101
18 Greek 1.00 1.00 1.00 153
19 Hakha_Chin 0.99 1.00 0.99 202
20 Indonesian 0.99 0.99 0.99 150
21 Interlingua 0.96 0.97 0.96 182
22 Italian 0.99 0.94 0.96 100
23 Japanese 1.00 1.00 1.00 144
24 Kabyle 1.00 0.96 0.98 156
25 Kinyarwanda 0.97 1.00 0.99 103
26 Kyrgyz 0.98 1.00 0.99 129
27 Latvian 0.98 0.98 0.98 171
28 Maltese 0.99 0.98 0.98 152
29 Mongolian 1.00 1.00 1.00 112
30 Persian 1.00 1.00 1.00 123
31 Polish 0.91 0.99 0.95 128
32 Portuguese 0.94 0.99 0.96 124
33 Romanian 1.00 1.00 1.00 152
34 Romansh_Sursilvan 0.99 0.95 0.97 106
35 Russian 0.99 0.99 0.99 100
36 Sakha 0.99 1.00 1.00 105
37 Slovenian 0.99 1.00 1.00 166
38 Spanish 0.96 0.95 0.95 94
39 Swedish 0.99 1.00 0.99 190
40 Tamil 1.00 1.00 1.00 135
41 Tatar 1.00 0.96 0.98 173
42 Turkish 1.00 1.00 1.00 137
43 Ukranian 0.99 1.00 1.00 126
44 Welsh 0.98 1.00 0.99 103
macro avg 0.98 0.99 0.98 5963
weighted avg 0.98 0.98 0.98 5963
overall accuracy 0.9837 5963
Downloads last month
67

Dataset used to train Mike0307/multilingual-e5-language-detection