SebOchs commited on
Commit
2790ac2
1 Parent(s): e4ed66c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +299 -0
README.md ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ace
4
+ - af
5
+ - als
6
+ - am
7
+ - an
8
+ - ang
9
+ - ar
10
+ - arz
11
+ - as
12
+ - ast
13
+ - av
14
+ - ay
15
+ - az
16
+ - azb
17
+ - ba
18
+ - bar
19
+ - bcl
20
+ - be
21
+ - bg
22
+ - bho
23
+ - bjn
24
+ - bn
25
+ - bo
26
+ - bpy
27
+ - br
28
+ - bs
29
+ - bxr
30
+ - ca
31
+ - cbk
32
+ - cdo
33
+ - ce
34
+ - ceb
35
+ - chr
36
+ - ckb
37
+ - co
38
+ - crh
39
+ - cs
40
+ - csb
41
+ - cv
42
+ - cy
43
+ - da
44
+ - de
45
+ - diq
46
+ - dsb
47
+ - dty
48
+ - dv
49
+ - egl
50
+ - el
51
+ - en
52
+ - eo
53
+ - es
54
+ - et
55
+ - eu
56
+ - ext
57
+ - fa
58
+ - fi
59
+ - fo
60
+ - fr
61
+ - frp
62
+ - fur
63
+ - fy
64
+ - ga
65
+ - gag
66
+ - gd
67
+ - gl
68
+ - glk
69
+ - gn
70
+ - gu
71
+ - gv
72
+ - ha
73
+ - hak
74
+ - he
75
+ - hi
76
+ - hif
77
+ - hr
78
+ - hsb
79
+ - ht
80
+ - hu
81
+ - hy
82
+ - ia
83
+ - id
84
+ - ie
85
+ - ig
86
+ - ilo
87
+ - io
88
+ - is
89
+ - it
90
+ - ja
91
+ - jam
92
+ - jbo
93
+ - jv
94
+ - ka
95
+ - kaa
96
+ - kab
97
+ - kbd
98
+ - kk
99
+ - km
100
+ - kn
101
+ - ko
102
+ - koi
103
+ - kok
104
+ - krc
105
+ - ksh
106
+ - ku
107
+ - kv
108
+ - kw
109
+ - ky
110
+ - la
111
+ - lad
112
+ - lb
113
+ - lez
114
+ - lg
115
+ - li
116
+ - lij
117
+ - lmo
118
+ - ln
119
+ - lo
120
+ - lrc
121
+ - lt
122
+ - ltg
123
+ - lv
124
+ - lzh
125
+ - mai
126
+ - map
127
+ - mdf
128
+ - mg
129
+ - mhr
130
+ - mi
131
+ - min
132
+ - mk
133
+ - ml
134
+ - mn
135
+ - mr
136
+ - mrj
137
+ - ms
138
+ - mt
139
+ - mwl
140
+ - my
141
+ - myv
142
+ - mzn
143
+ - nan
144
+ - nap
145
+ - nb
146
+ - nci
147
+ - nds
148
+ - ne
149
+ - new
150
+ - nl
151
+ - nn
152
+ - nrm
153
+ - nso
154
+ - nv
155
+ - oc
156
+ - olo
157
+ - om
158
+ - or
159
+ - os
160
+ - pa
161
+ - pag
162
+ - pam
163
+ - pap
164
+ - pcd
165
+ - pdc
166
+ - pfl
167
+ - pl
168
+ - pnb
169
+ - ps
170
+ - pt
171
+ - qu
172
+ - rm
173
+ - ro
174
+ - roa
175
+ - ru
176
+ - rue
177
+ - rup
178
+ - rw
179
+ - sa
180
+ - sah
181
+ - sc
182
+ - scn
183
+ - sco
184
+ - sd
185
+ - sgs
186
+ - sh
187
+ - si
188
+ - sk
189
+ - sl
190
+ - sme
191
+ - sn
192
+ - so
193
+ - sq
194
+ - sr
195
+ - srn
196
+ - stq
197
+ - su
198
+ - sv
199
+ - sw
200
+ - szl
201
+ - ta
202
+ - tcy
203
+ - te
204
+ - tet
205
+ - tg
206
+ - th
207
+ - tk
208
+ - tl
209
+ - tn
210
+ - to
211
+ - tr
212
+ - tt
213
+ - tyv
214
+ - udm
215
+ - ug
216
+ - uk
217
+ - ur
218
+ - uz
219
+ - vec
220
+ - vep
221
+ - vi
222
+ - vls
223
+ - vo
224
+ - vro
225
+ - wa
226
+ - war
227
+ - wo
228
+ - wuu
229
+ - xh
230
+ - xmf
231
+ - yi
232
+ - yo
233
+ - zea
234
+ - zh
235
+ language_bcp47:
236
+ - be-tarask
237
+ - map-bms
238
+ - nds-nl
239
+ - roa-tara
240
+ - zh-yue
241
+ tags:
242
+ - Language Identification
243
+ license: "apache-2.0"
244
+ datasets:
245
+ - wili_2018
246
+
247
+ metrics:
248
+ - accuracy
249
+ - macro F1-score
250
+ ---
251
+ # Canine for Language Identification
252
+ Canine model trained on WiLI-2018 dataset to identify the language of a text.
253
+
254
+ ### Preprocessing
255
+ - 10% of train data stratified sampled as validation set
256
+ - max sequence length: 512
257
+
258
+ ### Hyperparameters
259
+ - epochs: 4
260
+ - learning-rate: 3e-5
261
+ - batch size: 16
262
+ - gradient_accumulation: 4
263
+ - optimizer: AdamW with default settings
264
+
265
+ ### Test Results
266
+ - Accuracy: 94,92%
267
+ - Macro F1-score: 94,91%
268
+
269
+ ### Credit to
270
+ ```
271
+ @article{clark-etal-2022-canine,
272
+ title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
273
+ author = "Clark, Jonathan H. and
274
+ Garrette, Dan and
275
+ Turc, Iulia and
276
+ Wieting, John",
277
+ journal = "Transactions of the Association for Computational Linguistics",
278
+ volume = "10",
279
+ year = "2022",
280
+ address = "Cambridge, MA",
281
+ publisher = "MIT Press",
282
+ url = "https://aclanthology.org/2022.tacl-1.5",
283
+ doi = "10.1162/tacl_a_00448",
284
+ pages = "73--91",
285
+ abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
286
+ }
287
+ @dataset{thoma_martin_2018_841984,
288
+ author = {Thoma, Martin},
289
+ title = {{WiLI-2018 - Wikipedia Language Identification
290
+ database}},
291
+ month = jan,
292
+ year = 2018,
293
+ publisher = {Zenodo},
294
+ version = {1.0.0},
295
+ doi = {10.5281/zenodo.841984},
296
+ url = {https://doi.org/10.5281/zenodo.841984}
297
+ }
298
+ ```
299
+