|
--- |
|
license: apache-2.0 |
|
tags: |
|
- text-classification |
|
- language-identification |
|
library_name: fasttext |
|
datasets: |
|
- cis-lmu/GlotSparse |
|
- cis-lmu/GlotStoryBook |
|
metrics: |
|
- f1 |
|
--- |
|
|
|
# GlotLID |
|
|
|
[![GlotLID](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/cis-lmu/glotlid-space) |
|
|
|
## Description |
|
|
|
**GlotLID** is a Fasttext language identification (LID) model that supports more than **1600 languages**. |
|
|
|
- **Demo:** [huggingface](https://huggingface.co/spaces/cis-lmu/glotlid-space) |
|
- **Repository:** [github](https://github.com/cisnlp/GlotLID) |
|
- **Paper:** [paper](https://arxiv.org/abs/2310.16248) |
|
- **Point of Contact:** amir@cis.lmu.de |
|
|
|
|
|
|
|
### How to use |
|
|
|
Here is how to use this model to detect the language of a given text: |
|
|
|
```python |
|
>>> import fasttext |
|
>>> from huggingface_hub import hf_hub_download |
|
|
|
>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin") |
|
>>> model = fasttext.load_model(model_path) |
|
>>> model.predict("Hello, world!") |
|
``` |
|
|
|
If you are not a fan of huggingface_hub, then download the model directyly: |
|
|
|
```python |
|
>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin |
|
``` |
|
|
|
```python |
|
>>> import fasttext |
|
|
|
>>> model = fasttext.load_model("/path/to/model.bin") |
|
>>> model.predict("Hello, world!") |
|
``` |
|
|
|
|
|
## License |
|
|
|
The model is distributed under the Apache License, Version 2.0. |
|
|
|
## Version |
|
|
|
We always maintain the previous version of GlotLID in our repository. |
|
|
|
To access a specific version, simply append the version number to the `filename`. |
|
|
|
- For v1: `model_v1.bin` (introduced in the GlotLID [paper](https://arxiv.org/abs/2310.16248) and used in all experiments). |
|
- For v2: `model_v2.bin` (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1). |
|
|
|
`model.bin` always refers to the latest version (v2). |
|
|
|
|
|
## References |
|
|
|
If you use this model, please cite the following paper: |
|
|
|
``` |
|
@inproceedings{ |
|
kargaran2023glotlid, |
|
title={{GlotLID}: Language Identification for Low-Resource Languages}, |
|
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, |
|
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing}, |
|
year={2023}, |
|
url={https://openreview.net/forum?id=dl4e3EBz5j} |
|
} |
|
|
|
``` |