RoBERTa for Multilabel Language Segmentation

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier

data source language
open_subtitles ka, he, en, de
oscar be, kk, az, hu
tatoeba ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Validation Loss Precision Recall F1-Score Accuracy
0.029172 0.919623 0.933586 0.926552 0.991883
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train nikitast/lang-segmentation-roberta

Space using nikitast/lang-segmentation-roberta 1