cis-lmu
/

glotlid

Text Classification

language-identification

Model card Files Files and versions Community

kargaranamir commited on Mar 27

Commit

e773fb9

•

1 Parent(s): d27bf79

add v3.

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -1852,6 +1852,10 @@ metrics:
 **GlotLID** is a Fasttext language identification (LID) model that supports more than **1600 languages**.
 - **Demo:** [huggingface](https://huggingface.co/spaces/cis-lmu/glotlid-space)
 - **Repository:** [github](https://github.com/cisnlp/GlotLID)
 - **Paper:** [paper](https://arxiv.org/abs/2310.16248) (EMNLP 2023)
@@ -1867,6 +1871,7 @@ Here is how to use this model to detect the language of a given text:
 >>> import fasttext
 >>> from huggingface_hub import hf_hub_download
 >>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
 >>> model = fasttext.load_model(model_path)
 >>> model.predict("Hello, world!")
@@ -1898,8 +1903,10 @@ To access a specific version, simply append the version number to the `filename`
 - For v1: `model_v1.bin` (introduced in the GlotLID [paper](https://arxiv.org/abs/2310.16248) and used in all experiments).
 - For v2: `model_v2.bin` (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).
-`model.bin` always refers to the latest version (v2).
 ## References

 **GlotLID** is a Fasttext language identification (LID) model that supports more than **1600 languages**.
+**Latest:** GlotLID is now updated to **V3**. V3 supports **2114 labels** (three-letter ISO codes with script). For more details on the supported languages and performance, as well as significant changes from previous versions, please refer to [https://github.com/cisnlp/GlotLID/blob/main/languages-v3.md](https://github.com/cisnlp/GlotLID/blob/main/languages-v3.md).
 - **Demo:** [huggingface](https://huggingface.co/spaces/cis-lmu/glotlid-space)
 - **Repository:** [github](https://github.com/cisnlp/GlotLID)
 - **Paper:** [paper](https://arxiv.org/abs/2310.16248) (EMNLP 2023)
 >>> import fasttext
 >>> from huggingface_hub import hf_hub_download
+# model.bin is the latest version always
 >>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
 >>> model = fasttext.load_model(model_path)
 >>> model.predict("Hello, world!")
 - For v1: `model_v1.bin` (introduced in the GlotLID [paper](https://arxiv.org/abs/2310.16248) and used in all experiments).
 - For v2: `model_v2.bin` (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).
+- For v3: `model_v3.bin` (an edited version of v2, featuring more languages, excluding macro languages, further cleaned from noisy corpora and incorrect metadata labels based on the analysis of v2, supporting "zxx" and "und" series labels)
+`model.bin` always refers to the latest version (v3).
 ## References