nikitast
/

multilang-classifier-roberta

Text Classification

language classification

Inference Endpoints

Model card Files Files and versions Community

multilang-classifier-roberta / README.md

nikitast's picture

Add multilingual to the language tag (#1)

de87592 almost 2 years ago

|

983 Bytes

	---
	language:
	- ru
	- uk
	- be
	- kk
	- az
	- hy
	- ka
	- he
	- en
	- de
	- multilingual
	tags:
	- language classification
	datasets:
	- open_subtitles
	- tatoeba
	- oscar
	---

	# RoBERTa for Multilabel Language Classification
	## Training
	RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

	Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier

	\| data source \| language \|
	\|-----------------\|----------------\|
	\| open_subtitles \| ka, he, en, de \|
	\| oscar \| be, kk, az, hu \|
	\| tatoeba \| ru, uk \|

	## Validation
	The metrics obtained from validation on the another part of dataset (~1k samples per language).

	\| Training Loss \| Validation Loss \| F1-Score \| Roc Auc \| Accuracy \| Support \|
	\|---------------\|-----------------\|----------\|----------\|----------\|---------\|
	\| 0.161500 \| 0.110949 \| 0.947844 \| 0.953939 \| 0.762063 \| 26858 \|