nikitast's picture
Add multilingual to the language tag (#1)
de87592
metadata
language:
  - ru
  - uk
  - be
  - kk
  - az
  - hy
  - ka
  - he
  - en
  - de
  - multilingual
tags:
  - language classification
datasets:
  - open_subtitles
  - tatoeba
  - oscar

RoBERTa for Multilabel Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Training Loss Validation Loss F1-Score Roc Auc Accuracy Support
0.161500 0.110949 0.947844 0.953939 0.762063 26858