--- base_model: bert-base-multilingual-uncased # model-index: # - name: lang-recogn-model # results: # - task: # type: text-classification # dataset: # name: language-detection # type: language-detection # metrics: # - name: accuracy # type: accuracy # value: 0.9836 # source: # name: Language recognition using BERT # url: >- # https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert language: - ar - da - nl - en - fr - de - el - hi - it - kn - ml - pt - ru - es - sv - ta - tr pipeline_tag: text-classification widget: - text: "I have seen it somewhere..." example_title: "English" - text: "Ik heb het al gezien" example_title: "Dutch" - text: "Интересная идея" example_title: "Russian" - text: "Que vamos a hacer?" example_title: "Spanish" - text: "Hvor er der en pengeautomat?" example_title: "Danish" - text: "إنه مشوق جدا" example_title: "Arabic" - text: "Es ist sehr interessant" example_title: "German" - text: "c'est très intéressant" example_title: "French" - text: "Non ho mai visto una tale bellezza" example_title: "Italian" - text: "Jag har aldrig sett en sådan skönhet" example_title: "Swedish" - text: "Böyle bir güzellik görmedim" example_title: "Turkish" - text: "ಅದ್ಭುತ ಕಲ್ಪನೆ" example_title: "Kannada" - text: "அற்புதமான யோசனை" example_title: "Tamil" - text: "Υπέροχη ιδέα" example_title: "Greek" - text: "Eu nunca estive aqui" example_title: "Portugeese" - text: "मैं यहां कभी नहीं गया" example_title: "Hindi" - text: "ഞാൻ ഇവിടെ പോയിട്ടില്ല" example_title: "Malayam" license: mit --- # Language Detection Model The model presented in the following repository represents a fine-tuned version of `BertForSequenceClassification` pretrained on [multilingual texts](https://huggingface.co/bert-base-multilingual-uncased). ## Training/fine-tuning The model has been fine-tuned based on [Language Detection](https://www.kaggle.com/datasets/basilb2s/language-detection) dataset found on *Kaggle*. The entire process of the dataset analysis as well as a complete description of the training procedure can be found in [one of my *Kaggle* notebooks](https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert) which has been used for the purpose of a faster model training on *GPU*. ## Supported languages The model has been fine-tuned to detect one of the following 17 languages: - Arabic - Danish - Dutch - English - French - German - Greek - Hindi - Italian - Kannada - Malayalam - Portugeese - Russian - Spanish - Sweedish - Tamil - Turkish ## References 1. [BERT multilingual base model (uncased)](https://huggingface.co/bert-base-multilingual-uncased) 2. [Language Detection Dataset](https://www.kaggle.com/datasets/basilb2s/language-detection)