--- license: mit language: - fr - zh - fa - ky - ru - lt - uz - en - pt - bg - th - pl - ur - sw - tr - es - ar - it - hi - de - el - nl - vi - ja pipeline_tag: text-classification tags: - pytorch - mt0 --- # language identification mt0 This model is a fine-tuned version of encoder from [bigscience/mt0-small](https://huggingface.co/bigscience/mt0-small) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset as well as some private data. ## Limitations Currently, it supports the following 20 languages: arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), kyrgyz (ky), uzbek (uz), persian (fa), lithuanian (lt), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh) ## Inference First you will need to have this library installed ```python pip install bert-for-sequence classification ``` ```python from bert_clf import EncoderCLF import torch model = EncoderCLF("whitefoxredhell/language_identification") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) model = model.eval() text = "London is the capital of Great Britain" model.predict(text) # 'en' model.predict_proba(text) # { # 'fr': 3.022890814463608e-05, # 'zh': 2.328997834410984e-05, # 'fa': 5.344639430404641e-05, # 'ky': 3.5296812711749226e-05, # 'ru': 2.3277720174519345e-05, # 'lt': 0.00021786204888485372, # 'uz': 3.461417873040773e-05, # 'en': 0.999232292175293, # 'pt': 1.2590448022820055e-05, # 'bg': 1.5775613064761274e-05, # 'th': 9.429674719285686e-06, # 'pl': 2.4624938305350952e-05, # 'ur': 3.982995986007154e-05, # 'sw': 4.8921840061666444e-05, # 'tr': 2.6844283638638444e-05, # 'es': 2.325668538105674e-05, # 'ar': 2.4103366740746424e-05, # 'it': 1.8611381165101193e-05, # 'hi': 1.4575023669749498e-05, # 'de': 2.210299498983659e-05, # 'el': 1.3880739061278291e-05, # 'nl': 2.767637124634348e-05, # 'vi': 1.3878144272894133e-05, # 'ja': 1.3629408385895658e-05 # } ```