--- language: - ru - uk - be - kk - az - hy - ka - he - en - de tags: - language classification - text segmentation datasets: - open_subtitles - tatoeba - oscar --- # RoBERTa for Multilabel Language Segmentation ## Training RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language). Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier | data source | language | |-----------------|----------------| | open_subtitles | ka, he, en, de | | oscar | be, kk, az, hu | | tatoeba | ru, uk | ## Validation The metrics obtained from validation on the another part of dataset (~1k samples per language). | Validation Loss | Precision | Recall | F1-Score | Accuracy | |-----------------|-----------|----------|----------|----------| | 0.029172 | 0.919623 | 0.933586 | 0.926552 | 0.991883 |