Edit model card

RoBERTa for Multilabel Language Segmentation

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Validation Loss Precision Recall F1-Score Accuracy
0.029172 0.919623 0.933586 0.926552 0.991883
Downloads last month
26

Datasets used to train nikitast/lang-segmentation-roberta