SebOchs
/

canine-c-lang-id

+---
+language:
+  - ace
+  - af
+  - als
+  - am
+  - an
+  - ang
+  - ar
+  - arz
+  - as
+  - ast
+  - av
+  - ay
+  - az
+  - azb
+  - ba
+  - bar
+  - bcl
+  - be
+  - bg
+  - bho
+  - bjn
+  - bn
+  - bo
+  - bpy
+  - br
+  - bs
+  - bxr
+  - ca
+  - cbk
+  - cdo
+  - ce
+  - ceb
+  - chr
+  - ckb
+  - co
+  - crh
+  - cs
+  - csb
+  - cv
+  - cy
+  - da
+  - de
+  - diq
+  - dsb
+  - dty
+  - dv
+  - egl
+  - el
+  - en
+  - eo
+  - es
+  - et
+  - eu
+  - ext
+  - fa
+  - fi
+  - fo
+  - fr
+  - frp
+  - fur
+  - fy
+  - ga
+  - gag
+  - gd
+  - gl
+  - glk
+  - gn
+  - gu
+  - gv
+  - ha
+  - hak
+  - he
+  - hi
+  - hif
+  - hr
+  - hsb
+  - ht
+  - hu
+  - hy
+  - ia
+  - id
+  - ie
+  - ig
+  - ilo
+  - io
+  - is
+  - it
+  - ja
+  - jam
+  - jbo
+  - jv
+  - ka
+  - kaa
+  - kab
+  - kbd
+  - kk
+  - km
+  - kn
+  - ko
+  - koi
+  - kok
+  - krc
+  - ksh
+  - ku
+  - kv
+  - kw
+  - ky
+  - la
+  - lad
+  - lb
+  - lez
+  - lg
+  - li
+  - lij
+  - lmo
+  - ln
+  - lo
+  - lrc
+  - lt
+  - ltg
+  - lv
+  - lzh
+  - mai
+  - map
+  - mdf
+  - mg
+  - mhr
+  - mi
+  - min
+  - mk
+  - ml
+  - mn
+  - mr
+  - mrj
+  - ms
+  - mt
+  - mwl
+  - my
+  - myv
+  - mzn
+  - nan
+  - nap
+  - nb
+  - nci
+  - nds
+  - ne
+  - new
+  - nl
+  - nn
+  - nrm
+  - nso
+  - nv
+  - oc
+  - olo
+  - om
+  - or
+  - os
+  - pa
+  - pag
+  - pam
+  - pap
+  - pcd
+  - pdc
+  - pfl
+  - pl
+  - pnb
+  - ps
+  - pt
+  - qu
+  - rm
+  - ro
+  - roa
+  - ru
+  - rue
+  - rup
+  - rw
+  - sa
+  - sah
+  - sc
+  - scn
+  - sco
+  - sd
+  - sgs
+  - sh
+  - si
+  - sk
+  - sl
+  - sme
+  - sn
+  - so
+  - sq
+  - sr
+  - srn
+  - stq
+  - su
+  - sv
+  - sw
+  - szl
+  - ta
+  - tcy
+  - te
+  - tet
+  - tg
+  - th
+  - tk
+  - tl
+  - tn
+  - to
+  - tr
+  - tt
+  - tyv
+  - udm
+  - ug
+  - uk
+  - ur
+  - uz
+  - vec
+  - vep
+  - vi
+  - vls
+  - vo
+  - vro
+  - wa
+  - war
+  - wo
+  - wuu
+  - xh
+  - xmf
+  - yi
+  - yo
+  - zea
+  - zh
+language_bcp47:
+  - be-tarask
+  - map-bms
+  - nds-nl
+  - roa-tara
+  - zh-yue
+tags:
+- Language Identification
+license: "apache-2.0"
+datasets:
+- wili_2018
+metrics:
+- accuracy
+- macro F1-score
+---
+# Canine for Language Identification
+Canine model trained on WiLI-2018 dataset to identify the language of a text.
+### Preprocessing
+  - 10% of train data stratified sampled as validation set
+  - max sequence length: 512
+### Hyperparameters
+  - epochs: 4
+  - learning-rate: 3e-5
+  - batch size: 16
+  - gradient_accumulation: 4
+  - optimizer: AdamW with default settings
+### Test Results
+  - Accuracy: 94,92%
+  - Macro F1-score: 94,91%
+### Credit to
+```
+@article{clark-etal-2022-canine,
+    title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
+    author = "Clark, Jonathan H.  and
+      Garrette, Dan  and
+      Turc, Iulia  and
+      Wieting, John",
+    journal = "Transactions of the Association for Computational Linguistics",
+    volume = "10",
+    year = "2022",
+    address = "Cambridge, MA",
+    publisher = "MIT Press",
+    url = "https://aclanthology.org/2022.tacl-1.5",
+    doi = "10.1162/tacl_a_00448",
+    pages = "73--91",
+    abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
+}
+@dataset{thoma_martin_2018_841984,
+  author       = {Thoma, Martin},
+  title        = {{WiLI-2018 - Wikipedia Language Identification
+                   database}},
+  month        = jan,
+  year         = 2018,
+  publisher    = {Zenodo},
+  version      = {1.0.0},
+  doi          = {10.5281/zenodo.841984},
+  url          = {https://doi.org/10.5281/zenodo.841984}
+}
+```