|
--- |
|
base_model: bert-base-multilingual-uncased |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
language: |
|
- ar |
|
- da |
|
- nl |
|
- en |
|
- fr |
|
- de |
|
- el |
|
- hi |
|
- it |
|
- kn |
|
- ml |
|
- pt |
|
- ru |
|
- es |
|
- sv |
|
- ta |
|
- tr |
|
pipeline_tag: text-classification |
|
widget: |
|
- text: "I have seen it somewhere..." |
|
example_title: "English" |
|
- text: "Ik heb het al gezien" |
|
example_title: "Dutch" |
|
- text: "Интересная идея" |
|
example_title: "Russian" |
|
- text: "Que vamos a hacer?" |
|
example_title: "Spanish" |
|
- text: "Hvor er der en pengeautomat?" |
|
example_title: "Danish" |
|
- text: "إنه مشوق جدا" |
|
example_title: "Arabic" |
|
- text: "Es ist sehr interessant" |
|
example_title: "German" |
|
- text: "c'est très intéressant" |
|
example_title: "French" |
|
- text: "Non ho mai visto una tale bellezza" |
|
example_title: "Italian" |
|
- text: "Jag har aldrig sett en sådan skönhet" |
|
example_title: "Swedish" |
|
- text: "Böyle bir güzellik görmedim" |
|
example_title: "Turkish" |
|
- text: "ಅದ್ಭುತ ಕಲ್ಪನೆ" |
|
example_title: "Kannada" |
|
- text: "அற்புதமான யோசனை" |
|
example_title: "Tamil" |
|
- text: "Υπέροχη ιδέα" |
|
example_title: "Greek" |
|
- text: "Eu nunca estive aqui" |
|
example_title: "Portugeese" |
|
- text: "मैं यहां कभी नहीं गया" |
|
example_title: "Hindi" |
|
- text: "ഞാൻ ഇവിടെ പോയിട്ടില്ല" |
|
example_title: "Malayam" |
|
|
|
license: mit |
|
--- |
|
|
|
# Language Detection Model |
|
|
|
The model presented in the following repository represents a fine-tuned version of `BertForSequenceClassification` |
|
pretrained on [multilingual texts](https://huggingface.co/bert-base-multilingual-uncased). |
|
|
|
## Training/fine-tuning |
|
|
|
The model has been fine-tuned based on [Language Detection](https://www.kaggle.com/datasets/basilb2s/language-detection) |
|
dataset found on *Kaggle*. The entire process of the dataset analysis as well as a complete description of the training procedure |
|
can be found in [one of my *Kaggle* notebooks](https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert) |
|
which has been used for the purpose of a faster model training on *GPU*. |
|
|
|
## Supported languages |
|
|
|
The model has been fine-tuned to detect one of the following 17 languages: |
|
|
|
- Arabic |
|
- Danish |
|
- Dutch |
|
- English |
|
- French |
|
- German |
|
- Greek |
|
- Hindi |
|
- Italian |
|
- Kannada |
|
- Malayalam |
|
- Portugeese |
|
- Russian |
|
- Spanish |
|
- Sweedish |
|
- Tamil |
|
- Turkish |
|
|
|
## References |
|
|
|
1. [BERT multilingual base model (uncased)](https://huggingface.co/bert-base-multilingual-uncased) |
|
2. [Language Detection Dataset](https://www.kaggle.com/datasets/basilb2s/language-detection) |