Text Classification
fastText
2155 languages
language-identification
Edit model card

GlotLID

GlotLID

Description

GlotLID is a Fasttext language identification (LID) model that supports more than 2000 labels.

Latest: GlotLID is now updated to V3. V3 supports 2102 labels (three-letter ISO codes with script). For more details on the supported languages and performance, as well as significant changes from previous versions, please refer to https://github.com/cisnlp/GlotLID/blob/main/languages-v3.md.

How to use

Here is how to use this model to detect the language of a given text:

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

# model.bin is the latest version always
>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

If you are not a fan of huggingface_hub, then download the model directyly:

>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin
>>> import fasttext

>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")

License

The model is distributed under the Apache License, Version 2.0.

Version

We always maintain the previous version of GlotLID in our repository.

To access a specific version, simply append the version number to the filename.

  • For v1: model_v1.bin (introduced in the GlotLID paper and used in all experiments).
  • For v2: model_v2.bin (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).
  • For v3: model_v3.bin (an edited version of v2, featuring more languages, excluding macro languages, further cleaned from noisy corpora and incorrect metadata labels based on the analysis of v2, supporting "zxx" and "und" series labels)

model.bin always refers to the latest version (v3).

References

If you use this model, please cite the following paper:

@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}
Downloads last month
2,510
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train cis-lmu/glotlid

Spaces using cis-lmu/glotlid 8