nikitast commited on
Commit
475e27c
1 Parent(s): 5f5fa73

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - uk
5
+ - be
6
+ - kk
7
+ - az
8
+ - hy
9
+ - ka
10
+ - he
11
+ - en
12
+ - de
13
+ tags:
14
+ - language classification
15
+ datasets:
16
+ - open_subtitles
17
+ - tatoeba
18
+ - oscar
19
+ ---
20
+
21
+ # RoBERTa for Multilabel Language Classification
22
+ ## Training
23
+ RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).
24
+
25
+ Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier
26
+
27
+ | data source | language |
28
+ |-----------------|----------------|
29
+ | open_subtitles | ka, he, en, de |
30
+ | oscar | be, kk, az, hu |
31
+ | tatoeba | ru, uk |
32
+
33
+ ## Validation
34
+ The metrics obtained from validation on the another part of dataset (~1k samples per language).
35
+
36
+ | Training Loss | Validation Loss | F1-Score | Roc Auc | Accuracy | Support |
37
+ |---------------|-----------------|----------|----------|----------|---------|
38
+ | 0.161500 | 0.110949 | 0.947844 | 0.953939 | 0.762063 | 26858 |