canine-c-lang-id / README.md
SebOchs's picture
Add multilingual to the language tag (#1)
13e2e85
---
language:
- ace
- af
- als
- am
- an
- ang
- ar
- arz
- as
- ast
- av
- ay
- az
- azb
- ba
- bar
- bcl
- be
- bg
- bho
- bjn
- bn
- bo
- bpy
- br
- bs
- bxr
- ca
- cbk
- cdo
- ce
- ceb
- chr
- ckb
- co
- crh
- cs
- csb
- cv
- cy
- da
- de
- diq
- dsb
- dty
- dv
- egl
- el
- en
- eo
- es
- et
- eu
- ext
- fa
- fi
- fo
- fr
- frp
- fur
- fy
- ga
- gag
- gd
- gl
- glk
- gn
- gu
- gv
- ha
- hak
- he
- hi
- hif
- hr
- hsb
- ht
- hu
- hy
- ia
- id
- ie
- ig
- ilo
- io
- is
- it
- ja
- jam
- jbo
- jv
- ka
- kaa
- kab
- kbd
- kk
- km
- kn
- ko
- koi
- kok
- krc
- ksh
- ku
- kv
- kw
- ky
- la
- lad
- lb
- lez
- lg
- li
- lij
- lmo
- ln
- lo
- lrc
- lt
- ltg
- lv
- lzh
- mai
- map
- mdf
- mg
- mhr
- mi
- min
- mk
- ml
- mn
- mr
- mrj
- ms
- mt
- mwl
- my
- myv
- mzn
- nan
- nap
- nb
- nci
- nds
- ne
- new
- nl
- nn
- nrm
- nso
- nv
- oc
- olo
- om
- or
- os
- pa
- pag
- pam
- pap
- pcd
- pdc
- pfl
- pl
- pnb
- ps
- pt
- qu
- rm
- ro
- roa
- ru
- rue
- rup
- rw
- sa
- sah
- sc
- scn
- sco
- sd
- sgs
- sh
- si
- sk
- sl
- sme
- sn
- so
- sq
- sr
- srn
- stq
- su
- sv
- sw
- szl
- ta
- tcy
- te
- tet
- tg
- th
- tk
- tl
- tn
- to
- tr
- tt
- tyv
- udm
- ug
- uk
- ur
- uz
- vec
- vep
- vi
- vls
- vo
- vro
- wa
- war
- wo
- wuu
- xh
- xmf
- yi
- yo
- zea
- zh
- multilingual
license: apache-2.0
tags:
- Language Identification
datasets:
- wili_2018
metrics:
- accuracy
- macro F1-score
language_bcp47:
- be-tarask
- map-bms
- nds-nl
- roa-tara
- zh-yue
---
# Canine for Language Identification
Canine model trained on WiLI-2018 dataset to identify the language of a text.
### Preprocessing
- 10% of train data stratified sampled as validation set
- max sequence length: 512
### Hyperparameters
- epochs: 4
- learning-rate: 3e-5
- batch size: 16
- gradient_accumulation: 4
- optimizer: AdamW with default settings
### Test Results
- Accuracy: 94,92%
- Macro F1-score: 94,91%
### Inference
Dictionary to return English names for a label id:
```python
import datasets
import pycountry
def int_to_lang():
dataset = datasets.load_dataset('wili_2018')
# names for languages not in iso-639-3 from wikipedia
non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
# create dictionary from data set labels to language names
lab_to_lang = {}
for i, lang in enumerate(dataset['train'].features['label'].names):
full_lang = pycountry.languages.get(alpha_3=lang)
if full_lang:
lab_to_lang[i] = full_lang.name
else:
lab_to_lang[i] = non_iso_languages[lang]
return lab_to_lang
```
### Credit to
```
@article{clark-etal-2022-canine,
title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
author = "Clark, Jonathan H. and
Garrette, Dan and
Turc, Iulia and
Wieting, John",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2022.tacl-1.5",
doi = "10.1162/tacl_a_00448",
pages = "73--91",
abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
author = {Thoma, Martin},
title = {{WiLI-2018 - Wikipedia Language Identification
database}},
month = jan,
year = 2018,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.841984},
url = {https://doi.org/10.5281/zenodo.841984}
}
```