Edit model card

RoBERTa for Multilabel Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Training Loss Validation Loss F1-Score Roc Auc Accuracy Support
0.161500 0.110949 0.947844 0.953939 0.762063 26858
Downloads last month
14
Hosted inference API
Text Classification
Examples
Examples
This model can be loaded on the Inference API on-demand.

Datasets used to train nikitast/multilang-classifier-roberta