--- license: apache-2.0 datasets: - common_language language: - ar - eu - br - ca - zh - cv - cs - nl - en - eo - et - fr - ka - de - el - id - ia - it - ja - rw - ky - lv - mt - mn - fa - pl - pt - ro - rm - ru - sl - es - sv - ta - tt - tr - uk - cy metrics: - accuracy - precision - recall - f1 tags: - language-detection - Frisian - Dhivehi - Hakha_Chin - Kabyle - Sakha --- ### Overview This model supports the detection of **45** languages, and it's fine-tuned using **multilingual-e5-base** model on the **common-language** dataset.
The overall accuracy is **98.37%**, and more evaluation results are shown the below. ### Download the model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection') model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45) ``` ### Example of language detection ```python import torch languages = [ "Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong", "Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English", "Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek", "Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle", "Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish", "Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian", "Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh" ] def predict(text, model, tokenizer, device = torch.device('cpu')): model.to(device) model.eval() tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt") input_ids = tokenized['input_ids'] attention_mask = tokenized['attention_mask'] with torch.no_grad(): input_ids = input_ids.to(device) attention_mask = attention_mask.to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask) logits = outputs.logits probabilities = torch.nn.functional.softmax(logits, dim=1) return probabilities def get_topk(probabilities, languages, k=3): topk_prob, topk_indices = torch.topk(probabilities, k) topk_prob = topk_prob.cpu().numpy()[0].tolist() topk_indices = topk_indices.cpu().numpy()[0].tolist() topk_labels = [languages[index] for index in topk_indices] return topk_prob, topk_labels text = "你的測試句子" probabilities = predict(text, model, tokenizer) topk_prob, topk_labels = get_topk(probabilities, languages) print(topk_prob, topk_labels) # [0.999620258808, 0.00025940246996469, 2.7690215574693e-05] # ['Chinese_Taiwan', 'Chinese_Hongkong', 'Chinese_China'] ``` ### Evaluation Results The test datasets refers to the **common_language** test datasets. |index| language | precision | recall | f1-score | support | | --- | --- | --- | ---| --- | --- | |0|Arabic|1.00|1.00|1.00|151| |1| Basque | 0.99 | 1.00 | 1.00 | 111| |2| Breton | 1.00 | 0.90 | 0.95 | 252| |3| Catalan | 0.96 | 0.99 | 0.97 | 96| |4| Chinese_China | 0.98 | 1.00 | 0.99 | 100| |5| Chinese_Hongkong | 0.97 | 0.87 | 0.92 | 115| |6| Chinese_Taiwan | 0.92 | 0.98 | 0.95 | 170| |7| Chuvash | 0.98 | 1.00 | 0.99 | 137| |8| Czech | 0.98 | 1.00 | 0.99 | 128| |9| Dhivehi | 1.00 | 1.00 | 1.00 | 111| |10| Dutch | 0.99 | 1.00 | 0.99 | 144| |11| English | 0.96 | 1.00 | 0.98 | 98| |12| Esperanto | 0.98 | 0.98 | 0.98 | 107| |13| Estonian | 1.00 | 0.99 | 0.99 | 93| |14| French | 0.95 | 1.00 | 0.98 | 106| |15| Frisian | 1.00 | 0.98 | 0.99 | 117| |16| Georgian | 1.00 | 1.00 | 1.00 | 110| |17| German | 1.00 | 1.00 | 1.00 | 101| |18| Greek | 1.00 | 1.00 | 1.00 | 153| |19| Hakha_Chin | 0.99 | 1.00 | 0.99 | 202| |20| Indonesian | 0.99 | 0.99 | 0.99 | 150| |21| Interlingua | 0.96 | 0.97 | 0.96 | 182| |22| Italian | 0.99 | 0.94 | 0.96 | 100| |23| Japanese | 1.00 | 1.00 | 1.00 | 144| |24| Kabyle | 1.00 | 0.96 | 0.98 | 156| |25| Kinyarwanda | 0.97 | 1.00 | 0.99 | 103| |26| Kyrgyz | 0.98 | 1.00 | 0.99 | 129| |27| Latvian | 0.98 | 0.98 | 0.98 | 171| |28| Maltese | 0.99 | 0.98 | 0.98 | 152| |29| Mongolian | 1.00 | 1.00 | 1.00 | 112| |30| Persian | 1.00 | 1.00 | 1.00 | 123| |31| Polish | 0.91 | 0.99 | 0.95 | 128| |32| Portuguese | 0.94 | 0.99 | 0.96 | 124| |33| Romanian | 1.00 | 1.00 | 1.00 | 152| |34|Romansh_Sursilvan | 0.99 | 0.95 | 0.97 | 106| |35| Russian | 0.99 | 0.99 | 0.99 | 100| |36| Sakha | 0.99 | 1.00 | 1.00 | 105| |37| Slovenian | 0.99 | 1.00 | 1.00 | 166| |38| Spanish | 0.96 | 0.95 | 0.95 | 94| |39| Swedish | 0.99 | 1.00 | 0.99 | 190| |40| Tamil | 1.00 | 1.00 | 1.00 | 135| |41| Tatar | 1.00 | 0.96 | 0.98 | 173| |42| Turkish | 1.00 | 1.00 | 1.00 | 137| |43| Ukranian | 0.99 | 1.00 | 1.00 | 126| |44| Welsh | 0.98 | 1.00 | 0.99 | 103| || || *macro avg* | 0.98 | 0.99 | 0.98 | 5963| || *weighted avg* | 0.98 | 0.98 | 0.98 | 5963| ||| || *overall accuracy* | | | 0.9837 | 5963|