lang-recogn-model / README.md
spolivin's picture
Update README.md
ee467f0
|
raw
history blame
2.95 kB
metadata
base_model: bert-base-multilingual-uncased
model-index:
  - name: lang-recogn-model
    results:
      - task:
          type: text-classification
        dataset:
          name: language-detection
          type: language-detection
        metrics:
          - name: accuracy
            type: accuracy
            value: 0.9836
        source:
          name: Language recognition using BERT
          url: >-
            https://www.kaggle.com/code/sergeypolivin/language-recognition-using-bert
language:
  - ar
  - da
  - nl
  - en
  - fr
  - de
  - el
  - hi
  - it
  - kn
  - ml
  - pt
  - ru
  - es
  - sv
  - ta
  - tr
pipeline_tag: text-classification
widget:
  - text: I have seen it somewhere...
    example_title: English
  - text: Ik heb het al gezien
    example_title: Dutch
  - text: Интересная идея
    example_title: Russian
  - text: Que vamos a hacer?
    example_title: Spanish
  - text: Hvor er der en pengeautomat?
    example_title: Danish
  - text: إنه مشوق جدا
    example_title: Arabic
  - text: Es ist sehr interessant
    example_title: German
  - text: c'est très intéressant
    example_title: French
  - text: Non ho mai visto una tale bellezza
    example_title: Italian
  - text: Jag har aldrig sett en sådan skönhet
    example_title: Swedish
  - text: Böyle bir güzellik görmedim
    example_title: Turkish
  - text: ಅದ್ಭುತ ಕಲ್ಪನೆ
    example_title: Kannada
  - text: அற்புதமான யோசனை
    example_title: Tamil
  - text: Υπέροχη ιδέα
    example_title: Greek
  - text: Eu nunca estive aqui
    example_title: Portugeese
  - text: मैं यहां कभी नहीं गया
    example_title: Hindi
  - text: ഞാൻ ഇവിടെ പോയിട്ടില്ല
    example_title: Malayam
license: mit

Language Detection Model

The model presented in the following repository represents a fine-tuned version of BertForSequenceClassification pretrained on multilingual texts.

Training/fine-tuning

The model has been fine-tuned based on Language Detection dataset found on Kaggle. The entire process of the dataset analysis as well as a complete description of the training procedure can be found in one of my Kaggle notebooks which has been used for the purpose of a faster model training on GPU.

Supported languages

The model has been fine-tuned to detect one of the following 17 languages:

  • Arabic
  • Danish
  • Dutch
  • English
  • French
  • German
  • Greek
  • Hindi
  • Italian
  • Kannada
  • Malayalam
  • Portugeese
  • Russian
  • Spanish
  • Sweedish
  • Tamil
  • Turkish

References

  1. BERT multilingual base model (uncased)
  2. Language Detection Dataset