RoBERTa for Single Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

index class f1-score precision recall support
0 az 0.998 0.997 1.0 997
1 be 0.996 0.998 0.994 1004
2 de 0.976 0.966 0.987 979
3 en 0.976 0.986 0.967 1020
4 he 1.0 1.0 0.999 1001
5 hy 0.994 0.991 0.998 993
6 ka 0.999 0.999 0.999 1000
7 kk 0.996 0.998 0.993 1005
8 uk 0.982 0.997 0.968 1030
9 ru 0.982 0.968 0.997 971
10 macro_avg 0.99 0.99 0.99 10000
11 weighted avg 0.99 0.99 0.99 10000
Downloads last month
38
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train nikitast/lang-classifier-roberta