--- base_model: bert-base-multilingual-uncased model-index: - name: lang-recogn-model results: - task: type: text-classification dataset: name: language-detection type: language-detection metrics: - name: accuracy type: accuracy value: 0.9836 source: name: Language recognition using BERT url: >- https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert language: - ar - da - nl - en - fr - de - el - hi - it - kn - ml - pt - ru - es - sv - ta - tr pipeline_tag: text-classification widget: - text: "Hello, world" example_title: "English" - text: "Ik heb het al gezien" example_title: "Dutch" - text: "Интересная идея" example_title: "Russian" - text: "Que vamos a hacer?" example_title: "Spanish" - text: "Hvor er der en pengeautomat?" example_title: "Danish" --- # Language Detection Model The model presented in the following repository represents a fine-tuned version of `BertForSequenceClassification` pretrained on [multilingual texts](https://huggingface.co/bert-base-multilingual-uncased). ## Training/fine-tuning The model has been fine-tuned based on [Language Detection](https://www.kaggle.com/datasets/basilb2s/language-detection) dataset found on *Kaggle*. The entire process of the dataset analysis as well as a complete description of the training procedure can be found in [one of my *Kaggle* notebooks](https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert) which has been used for the purpose of a faster model training on *GPU*. ## Supported languages The model has been fine-tuned to detect one of the following 17 languages: - Arabic - Danish - Dutch - English - French - German - Greek - Hindi - Italian - Kannada - Malayalam - Portugeese - Russian - Spanish - Sweedish - Tamil - Turkish ## References 1. [BERT multilingual base model (uncased)](https://huggingface.co/bert-base-multilingual-uncased) 2. [Language Detection Dataset](https://www.kaggle.com/datasets/basilb2s/language-detection)