nikitast commited on
Commit
2e44dd4
1 Parent(s): d0ae5d3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - uk
5
+ - be
6
+ - kk
7
+ - az
8
+ - hy
9
+ - ka
10
+ - he
11
+ - en
12
+ - de
13
+ tags:
14
+ - language classification
15
+ - text segmentation
16
+ datasets:
17
+ - open_subtitles
18
+ - tatoeba
19
+ - oscar
20
+ ---
21
+
22
+ # RoBERTa for Multilabel Language Segmentation
23
+ ## Training
24
+ RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).
25
+
26
+ Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier
27
+
28
+ | data source | language |
29
+ |-----------------|----------------|
30
+ | open_subtitles | ka, he, en, de |
31
+ | oscar | be, kk, az, hu |
32
+ | tatoeba | ru, uk |
33
+
34
+ ## Validation
35
+ The metrics obtained from validation on the another part of dataset (~1k samples per language).
36
+
37
+ | Validation Loss | Precision | Recall | F1-Score | Accuracy |
38
+ |-----------------|-----------|----------|----------|----------|
39
+ | 0.029172 | 0.919623 | 0.933586 | 0.926552 | 0.991883 |