--- language: - ace - af - als - am - an - ang - ar - arz - as - ast - av - ay - az - azb - ba - bar - bcl - be - bg - bho - bjn - bn - bo - bpy - br - bs - bxr - ca - cbk - cdo - ce - ceb - chr - ckb - co - crh - cs - csb - cv - cy - da - de - diq - dsb - dty - dv - egl - el - en - eo - es - et - eu - ext - fa - fi - fo - fr - frp - fur - fy - ga - gag - gd - gl - glk - gn - gu - gv - ha - hak - he - hi - hif - hr - hsb - ht - hu - hy - ia - id - ie - ig - ilo - io - is - it - ja - jam - jbo - jv - ka - kaa - kab - kbd - kk - km - kn - ko - koi - kok - krc - ksh - ku - kv - kw - ky - la - lad - lb - lez - lg - li - lij - lmo - ln - lo - lrc - lt - ltg - lv - lzh - mai - map - mdf - mg - mhr - mi - min - mk - ml - mn - mr - mrj - ms - mt - mwl - my - myv - mzn - nan - nap - nb - nci - nds - ne - new - nl - nn - nrm - nso - nv - oc - olo - om - or - os - pa - pag - pam - pap - pcd - pdc - pfl - pl - pnb - ps - pt - qu - rm - ro - roa - ru - rue - rup - rw - sa - sah - sc - scn - sco - sd - sgs - sh - si - sk - sl - sme - sn - so - sq - sr - srn - stq - su - sv - sw - szl - ta - tcy - te - tet - tg - th - tk - tl - tn - to - tr - tt - tyv - udm - ug - uk - ur - uz - vec - vep - vi - vls - vo - vro - wa - war - wo - wuu - xh - xmf - yi - yo - zea - zh - multilingual license: apache-2.0 tags: - Language Identification datasets: - wili_2018 metrics: - accuracy - macro F1-score language_bcp47: - be-tarask - map-bms - nds-nl - roa-tara - zh-yue --- # Canine for Language Identification Canine model trained on WiLI-2018 dataset to identify the language of a text. ### Preprocessing - 10% of train data stratified sampled as validation set - max sequence length: 512 ### Hyperparameters - epochs: 4 - learning-rate: 3e-5 - batch size: 16 - gradient_accumulation: 4 - optimizer: AdamW with default settings ### Test Results - Accuracy: 94,92% - Macro F1-score: 94,91% ### Inference Dictionary to return English names for a label id: ```python import datasets import pycountry def int_to_lang(): dataset = datasets.load_dataset('wili_2018') # names for languages not in iso-639-3 from wikipedia non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan', 'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'} # create dictionary from data set labels to language names lab_to_lang = {} for i, lang in enumerate(dataset['train'].features['label'].names): full_lang = pycountry.languages.get(alpha_3=lang) if full_lang: lab_to_lang[i] = full_lang.name else: lab_to_lang[i] = non_iso_languages[lang] return lab_to_lang ``` ### Credit to ``` @article{clark-etal-2022-canine, title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation", author = "Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.5", doi = "10.1162/tacl_a_00448", pages = "73--91", abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.", } @dataset{thoma_martin_2018_841984, author = {Thoma, Martin}, title = {{WiLI-2018 - Wikipedia Language Identification database}}, month = jan, year = 2018, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.841984}, url = {https://doi.org/10.5281/zenodo.841984} } ```