nikitast's picture
Add multilingual to the language tag (#1)
7c97c98
metadata
language:
  - ru
  - uk
  - be
  - kk
  - az
  - hy
  - ka
  - he
  - en
  - de
  - multilingual
tags:
  - language classification
datasets:
  - open_subtitles
  - tatoeba
  - oscar

RoBERTa for Single Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

index class f1-score precision recall support
0 az 0.998 0.997 1.0 997
1 be 0.996 0.998 0.994 1004
2 de 0.976 0.966 0.987 979
3 en 0.976 0.986 0.967 1020
4 he 1.0 1.0 0.999 1001
5 hy 0.994 0.991 0.998 993
6 ka 0.999 0.999 0.999 1000
7 kk 0.996 0.998 0.993 1005
8 uk 0.982 0.997 0.968 1030
9 ru 0.982 0.968 0.997 971
10 macro_avg 0.99 0.99 0.99 10000
11 weighted avg 0.99 0.99 0.99 10000