igorsterner commited on
Commit
12250e1
1 Parent(s): 9f3bef1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -20
README.md CHANGED
@@ -1,42 +1,171 @@
1
  ---
2
  license: mit
3
  base_model: xlm-roberta-base
4
- tags:
5
- - generated_from_trainer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  metrics:
7
- - precision
8
- - recall
9
  - f1
10
- model-index:
11
- - name: xlm-roberta-base-Multilingual-Sentence-Segmentation-v4
12
- results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # xlm-roberta-base-Multilingual-Sentence-Segmentation-v4
19
-
20
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
21
- It achieves the following results on the evaluation set:
22
  - Loss: 0.0074
23
  - Precision: 0.9664
24
  - Recall: 0.9677
25
  - F1: 0.9670
26
 
27
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- More information needed
 
 
 
 
30
 
31
- ## Intended uses & limitations
32
 
33
- More information needed
34
 
35
- ## Training and evaluation data
36
 
37
- More information needed
38
 
39
- ## Training procedure
40
 
41
  ### Training hyperparameters
42
 
 
1
  ---
2
  license: mit
3
  base_model: xlm-roberta-base
4
+ language:
5
+ - multilingual
6
+ - af
7
+ - am
8
+ - ar
9
+ - as
10
+ - az
11
+ - be
12
+ - bg
13
+ - bn
14
+ - br
15
+ - bs
16
+ - ca
17
+ - cs
18
+ - cy
19
+ - da
20
+ - de
21
+ - el
22
+ - en
23
+ - eo
24
+ - es
25
+ - et
26
+ - eu
27
+ - fa
28
+ - fi
29
+ - fr
30
+ - fy
31
+ - ga
32
+ - gd
33
+ - gl
34
+ - gu
35
+ - ha
36
+ - he
37
+ - hi
38
+ - hr
39
+ - hu
40
+ - hy
41
+ - id
42
+ - is
43
+ - it
44
+ - ja
45
+ - jv
46
+ - ka
47
+ - kk
48
+ - km
49
+ - kn
50
+ - ko
51
+ - ku
52
+ - ky
53
+ - la
54
+ - lo
55
+ - lt
56
+ - lv
57
+ - mg
58
+ - mk
59
+ - ml
60
+ - mn
61
+ - mr
62
+ - ms
63
+ - my
64
+ - ne
65
+ - nl
66
+ - 'no'
67
+ - om
68
+ - or
69
+ - pa
70
+ - pl
71
+ - ps
72
+ - pt
73
+ - ro
74
+ - ru
75
+ - sa
76
+ - sd
77
+ - si
78
+ - sk
79
+ - sl
80
+ - so
81
+ - sq
82
+ - sr
83
+ - su
84
+ - sv
85
+ - sw
86
+ - ta
87
+ - te
88
+ - th
89
+ - tl
90
+ - tr
91
+ - ug
92
+ - uk
93
+ - ur
94
+ - uz
95
+ - vi
96
+ - xh
97
+ - yi
98
+ - zh
99
  metrics:
 
 
100
  - f1
 
 
 
101
  ---
102
 
103
+ # xlmr-multilingual-sentence-segmentation
 
104
 
105
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on a corrupted version of the universal dependency datasets.
106
+ It achieves the following results on the (also corrupted) evaluation set:
 
 
107
  - Loss: 0.0074
108
  - Precision: 0.9664
109
  - Recall: 0.9677
110
  - F1: 0.9670
111
 
112
+ # Test set performance
113
+
114
+ # Results
115
+
116
+ All results here are percentage F1:
117
+
118
+ ## Opus100 [2]
119
+
120
+ Who wins most? XLM-RoBERTa: 56, WtPSplit: 12, Spacy (multilingual): 8
121
+
122
+
123
+ | | af | am | ar | az | be | bg | bn | ca | cs | cy | da | de | el | en | eo | es | et | eu | fa | fi | fr | fy | ga | gd | gl | gu | ha | he | hi | hu | hy | id | is | it | ja | ka | kk | km | kn | ko | ku | ky | lt | lv | mg | mk | ml | mn | mr | ms | my | ne | nl | pa | pl | ps | pt | ro | ru | si | sk | sl | sq | sr | sv | ta | te | th | tr | uk | ur | uz | vi | xh | yi | zh |
124
+ |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
125
+ | Spacy (multilingual) | 42.61 | 6.69 | 58.52 | 73.59 | 34.78 | 93.74 | 38.04 | 88.76 | 87.70 | 26.30 | 90.52 | 74.15 | 89.75 | 89.25 | 88.77 | 90.95 | 87.26 | 81.20 | 55.40 | 93.28 | 85.77 | 21.49 | 60.61 | 36.83 | 88.77 | 5.59 | **89.39** | **92.21** | 53.33 | 93.26 | 24.14 | 90.13 | **95.38** | 86.32 | 0.20 | 38.24 | 42.39 | 0.10 | 9.66 | 51.79 | 27.64 | 21.77 | 76.91 | 77.02 | 83.60 | **93.74** | 39.09 | 33.23 | 86.56 | 87.39 | 0.10 | 6.59 | **93.65** | 5.26 | 92.42 | 2.41 | 92.07 | 91.63 | 75.95 | 75.91 | 92.13 | 93.00 | **92.96** | **95.01** | 93.52 | 36.97 | 64.59 | 21.64 | **94.05** | 89.68 | 29.17 | 64.99 | 90.59 | 64.89 | 4.14 | 0.09 |
126
+ | WtPSplit | 76.90 | **59.08** | 68.08 | 76.42 | 71.29 | 93.97 | 79.76 | 89.79 | 89.36 | 73.21 | 90.02 | 80.74 | 92.80 | 91.91 | 92.24 | 92.11 | 84.47 | 87.24 | 59.97 | 91.96 | 88.53 | 65.84 | 79.49 | 83.33 | 90.31 | **70.51** | 82.43 | 90.58 | 66.70 | 93.00 | 87.14 | 89.80 | 94.77 | 87.43 | **41.79** | **91.26** | 73.25 | **69.54** | 68.98 | 56.21 | **79.12** | 83.94 | 81.33 | 82.70 | **89.33** | 92.87 | 80.81 | 73.26 | 89.20 | 88.51 | **65.54** | **71.33** | 92.63 | 64.11 | 92.72 | **62.84** | 91.05 | 90.91 | 84.23 | 80.32 | 92.30 | 92.19 | 90.32 | 94.76 | 92.08 | 63.48 | 76.49 | 68.88 | 93.30 | 89.60 | 52.59 | **77.79** | 91.29 | 80.28 | **75.70** | 71.64 |
127
+ | XLM-RoBERTa (ours) | **83.97** | 41.59 | **81.56** | **81.30** | **85.68** | **94.34** | **84.10** | **91.80** | **91.23** | **78.72** | **92.64** | **86.73** | **93.87** | **94.50** | **94.57** | **93.18** | **90.19** | **90.28** | **74.79** | **94.06** | **90.46** | **81.76** | **84.33** | **85.62** | **92.55** | 67.26 | 86.61 | 91.22 | **72.69** | **94.53** | **89.83** | **92.24** | 93.78 | **89.27** | 41.43 | 78.39 | **89.15** | 36.60 | **70.51** | **82.77** | 58.14 | **89.41** | **89.99** | **88.25** | 86.82 | 92.81 | **86.14** | **94.73** | **93.25** | **92.44** | 49.39 | 66.02 | 93.60 | **69.22** | **93.51** | 61.86 | **92.84** | **93.19** | **89.47** | **86.24** | **92.95** | **93.46** | 91.79 | 94.16 | **93.93** | **72.74** | **81.77** | **74.49** | 93.17 | **92.15** | **62.92** | 75.65 | **93.41** | **84.89** | 56.85 | **77.07** |
128
+
129
+
130
+ ## Universal Dependencies [3]
131
+
132
+ Who wins most? XLM-RoBERTa: 24, WtPSplit: 17 Spacy (multilingual): 13
133
+
134
+
135
+ | | af | ar | be | bg | bn | ca | cs | cy | da | de | el | en | es | et | eu | fa | fi | fr | ga | gd | gl | he | hi | hu | hy | id | is | it | ja | jv | kk | ko | la | lt | lv | mr | nl | pl | pt | ro | ru | sk | sl | sq | sr | sv | ta | th | tr | uk | ur | vi | zh |
136
+ |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
137
+ | Spacy (multilingual) | **98.47** | 80.38 | 80.27 | 93.62 | 51.85 | **98.95** | 89.68 | 98.89 | 94.96 | 88.02 | 94.16 | 92.20 | **98.70** | 93.77 | 95.79 | **99.83** | 92.88 | 96.33 | **96.67** | 63.04 | 92.37 | 94.37 | 0.32 | **98.45** | 11.39 | 98.01 | **95.41** | 92.49 | 0.37 | 98.03 | 96.21 | **99.80** | 0.09 | 93.86 | **98.52** | 92.13 | 92.86 | 97.02 | 94.91 | **98.05** | 84.31 | 90.26 | **98.23** | **100.00** | 97.84 | 94.91 | 66.67 | 1.95 | **97.63** | 94.16 | 0.37 | 96.40 | 0.40 |
138
+ | WtPSplit | 98.27 | **83.00** | 89.28 | **98.16** | **99.12** | 98.52 | 92.98 | **99.26** | 94.56 | 96.13 | **96.94** | 94.73 | 97.60 | 94.09 | 97.24 | 97.29 | 94.69 | **96.71** | 86.60 | 72.17 | **98.87** | 95.79 | 96.78 | 96.08 | **96.80** | **98.41** | 86.39 | 95.45 | **95.84** | **98.18** | 96.28 | 99.11 | 91.43 | **97.67** | 96.42 | 91.84 | 93.61 | 95.92 | **96.13** | 81.50 | 86.28 | 95.57 | 96.85 | 99.17 | **98.45** | **95.86** | **97.54** | 70.26 | 96.00 | 92.08 | 93.79 | 92.97 | **97.25** |
139
+ | XLM-RoBERTa (ours) | 96.81 | 78.99 | **91.60** | 97.89 | **99.12** | 95.99 | **96.05** | 97.17 | **96.62** | **96.29** | 94.33 | **94.76** | 95.73 | **96.20** | **97.37** | 97.49 | **96.34** | 95.70 | 89.78 | **84.20** | 95.72 | **95.95** | **97.51** | 96.24 | 95.62 | 97.22 | 92.93 | **96.88** | 94.23 | 96.29 | **98.40** | 97.46 | **96.35** | 95.82 | 96.91 | **95.92** | **96.27** | **97.24** | 95.83 | 94.63 | **91.59** | **95.88** | 96.43 | 98.36 | 96.83 | 94.95 | 95.93 | **89.26** | 96.52 | **94.59** | **96.20** | **97.31** | 95.12 |
140
+
141
+ ## Ersatz [4]
142
+
143
+ Who wins most? XLM-RoBERTa: 10, WtPSplit: 8, Spacy (multilingual): 4
144
+
145
+
146
+ | | ar | cs | de | en | es | et | fi | fr | gu | hi | ja | kk | km | lt | lv | pl | ps | ro | ru | ta | tr | zh |
147
+ |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
148
+ | Spacy (multilingual) | **91.26** | 96.46 | 93.89 | 94.40 | 97.31 | **97.15** | 94.99 | 96.43 | 4.44 | 18.41 | 0.18 | 97.11 | 0.08 | 93.53 | **98.73** | 93.69 | **94.44** | 94.87 | 93.45 | 68.65 | 95.39 | 0.10 |
149
+ | WtPSplit | 89.45 | 93.41 | 95.93 | **97.16** | **98.74** | 95.84 | 97.10 | **97.61** | 90.62 | 94.87 | **82.14** | 95.94 | **82.89** | **96.74** | 97.22 | 95.16 | 86.99 | **97.55** | **97.82** | 94.76 | 93.53 | 89.02 |
150
+ | XLM-RoBERTa (ours) | 79.78 | **96.94** | **97.02** | 96.10 | 97.06 | 96.80 | **97.67** | 96.33 | **93.73** | **95.34** | 77.54 | **97.28** | 78.94 | 96.13 | 96.45 | **96.71** | 92.33 | 96.24 | 97.15 | **95.94** | **95.76** | **90.11** |
151
+
152
+ ## German--English code-switching [5]
153
 
154
+ | | de |
155
+ |:---------------------|:----------|
156
+ | Spacy (multilingual) | 79.55 |
157
+ | WtPSplit | 77.41 |
158
+ | XLM-RoBERTa (ours) | **85.78** |
159
 
160
+ [1] [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398) (Minixhofer et al., ACL 2023)
161
 
162
+ [2] [Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation](https://aclanthology.org/2020.acl-main.148) (Zhang et al., ACL 2020)
163
 
164
+ [3] [Universal Dependencies](https://aclanthology.org/2021.cl-2.11) (de Marneffe et al., CL 2021)
165
 
166
+ [4] [A unified approach to sentence segmentation of punctuated text in many languages](https://aclanthology.org/2021.acl-long.309) (Wicks & Post, ACL-IJCNLP 2021)
167
 
168
+ [5] [The Denglisch Corpus of German-English Code-Switching](https://aclanthology.org/2023.sigtyp-1.5) (Osmelak & Wintner, SIGTYP 2023)
169
 
170
  ### Training hyperparameters
171