metadata
license: openrail
Nordic language identification
This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, a model that identifies between the 13 most common languages in the Nordic countries, and a model that extends that 159 languages in the world.
nordic-lid.bin
Trained on sentences from the GiellaT's Tranlation Memories and Wortschatz's corpora.
ISO-639-3 | Language | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|
dan | Danish | 0.9720 | 0.9838 | 0.9779 | 494 |
eng | English | 0.9980 | 0.9940 | 0.9960 | 502 |
fao | Faroese | 0.9920 | 0.9940 | 0.9930 | 499 |
fin | Finnish | 1.0000 | 1.0000 | 1.0000 | 500 |
isl | Icelandic | 0.9900 | 0.9920 | 0.9910 | 499 |
nno | Norwegian Nynorsk | 0.9920 | 0.9861 | 0.9890 | 503 |
nob | Norwegian Bokmål | 0.9840 | 0.9743 | 0.9791 | 505 |
sma | Southern Sami | 0.9800 | 0.9703 | 0.9751 | 101 |
sme | Northern Sami | 1.0000 | 0.9921 | 0.9960 | 504 |
smj | Lule Sami | 0.9920 | 0.9960 | 0.9940 | 498 |
smn | Inari Sami | 0.9950 | 1.0000 | 0.9975 | 199 |
sms | Skolt Sami | 0.9900 | 0.9950 | 0.9925 | 199 |
swe | Swedish | 0.9860 | 0.9920 | 0.9890 | 497 |
Accuracy | 0.9905 | 5500 | |||
Weighted avg | 0.9906 | 0.9905 | 0.9905 | 5500 | |
Macro avg | 0.9901 | 0.9900 | 0.9900 | 5500 |
nordic-lid_all.bin
Additionally trained on sentences from Taoteba.
ISO-639-3 | Language | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|
afr | Afrikaans | 0.9476 | 0.9476 | 0.9476 | 191 |
ara | Arabic | 0.9708 | 0.9472 | 0.9588 | 492 |
arq | Algerian Arabic | 0.9478 | 0.9237 | 0.9356 | 118 |
arz | Egyptian Arabic | 0.6316 | 0.7660 | 0.6923 | 47 |
asm | Assamese | 0.9828 | 0.9884 | 0.9856 | 173 |
avk | Kotava | 0.9791 | 0.9894 | 0.9842 | 189 |
aze | Azerbaijani | 0.9707 | 0.9789 | 0.9748 | 237 |
bel | Belarusian | 0.9892 | 0.9733 | 0.9812 | 375 |
ben | Bengali | 0.9872 | 0.9872 | 0.9872 | 235 |
ber | Berber | 0.8881 | 0.8388 | 0.8627 | 577 |
bos | Bosnian | 0.1310 | 0.3333 | 0.1880 | 33 |
bre | Breton | 0.9648 | 0.9786 | 0.9716 | 280 |
bua | Buryat | 0.9111 | 0.9111 | 0.9111 | 45 |
bul | Bulgarian | 0.9597 | 0.9662 | 0.9630 | 444 |
cat | Catalan | 0.9538 | 0.9475 | 0.9507 | 305 |
cbk | Chavacano | 0.9627 | 0.9773 | 0.9699 | 132 |
ceb | Cebuano | 0.8205 | 0.8533 | 0.8366 | 75 |
ces | Czech | 0.9606 | 0.9740 | 0.9672 | 500 |
chv | Chuvash | 0.9756 | 0.9877 | 0.9816 | 81 |
ckb | Central Kurdish (Soranî) | 0.9751 | 0.9915 | 0.9832 | 355 |
ckt | Chukchi | 0.9615 | 1.0000 | 0.9804 | 25 |
cmn | Mandarin Chinese | 0.9530 | 0.8743 | 0.9120 | 557 |
cor | Cornish | 0.9945 | 0.9628 | 0.9784 | 188 |
csb | Kashubian | 0.9574 | 1.0000 | 0.9783 | 45 |
cym | Welsh | 0.9375 | 0.9615 | 0.9494 | 78 |
dan | Danish | 0.9401 | 0.9363 | 0.9382 | 1005 |
deu | German | 0.9853 | 0.9781 | 0.9817 | 549 |
dsb | Lower Sorbian | 0.8704 | 0.8246 | 0.8468 | 57 |
dtp | Central Dusun | 0.8881 | 0.9549 | 0.9203 | 133 |
ell | Greek | 0.9979 | 0.9979 | 0.9979 | 475 |
eng | English | 0.9895 | 0.9839 | 0.9867 | 1055 |
epo | Esperanto | 0.9817 | 0.9926 | 0.9871 | 540 |
est | Estonian | 0.9545 | 0.9711 | 0.9628 | 173 |
eus | Basque | 0.9844 | 0.9583 | 0.9712 | 264 |
fao | Faroese | 0.9820 | 0.9859 | 0.9840 | 498 |
fin | Finnish | 0.9932 | 0.9780 | 0.9855 | 1045 |
fkv | Kven Finnish | 0.6154 | 0.8889 | 0.7273 | 18 |
fra | French | 0.9871 | 0.9908 | 0.9890 | 542 |
frr | North Frisian | 0.9640 | 0.9710 | 0.9675 | 138 |
fry | Frisian | 0.6774 | 0.9545 | 0.7925 | 22 |
gcf | Guadeloupean Creole French | 0.9619 | 1.0000 | 0.9806 | 101 |
gla | Scottish Gaelic | 0.9412 | 0.9796 | 0.9600 | 49 |
gle | Irish | 0.9635 | 0.9778 | 0.9706 | 135 |
glg | Galician | 0.9104 | 0.9369 | 0.9234 | 206 |
gos | Gronings | 0.9549 | 0.9588 | 0.9569 | 243 |
grc | Ancient Greek | 0.9828 | 0.9828 | 0.9828 | 58 |
grn | Guarani | 0.9684 | 0.9935 | 0.9808 | 154 |
guc | Wayuu | 0.9111 | 0.9762 | 0.9425 | 42 |
hau | Hausa | 0.9814 | 0.9953 | 0.9883 | 425 |
heb | Hebrew | 1.0000 | 1.0000 | 1.0000 | 536 |
hin | Hindi | 1.0000 | 0.9974 | 0.9987 | 391 |
hoc | Ho | 0.9429 | 0.9167 | 0.9296 | 36 |
hrv | Croatian | 0.7447 | 0.6119 | 0.6718 | 286 |
hrx | Hunsrik | 0.8727 | 0.9231 | 0.8972 | 52 |
hsb | Upper Sorbian | 0.8400 | 0.8289 | 0.8344 | 76 |
hun | Hungarian | 0.9853 | 0.9926 | 0.9889 | 539 |
hye | Armenian | 1.0000 | 1.0000 | 1.0000 | 225 |
ido | Ido | 0.9791 | 0.9563 | 0.9676 | 343 |
ile | Interlingue | 0.9352 | 0.9416 | 0.9384 | 291 |
ilo | Ilocano | 0.9917 | 0.9600 | 0.9756 | 125 |
ina | Interlingua | 0.9558 | 0.9621 | 0.9589 | 449 |
ind | Indonesian | 0.8526 | 0.8203 | 0.8361 | 423 |
isl | Icelandic | 0.9863 | 0.9897 | 0.9880 | 871 |
ita | Italian | 0.9817 | 0.9711 | 0.9764 | 553 |
jav | Javanese | 0.9600 | 0.9600 | 0.9600 | 50 |
jbo | Lojban | 1.0000 | 0.9926 | 0.9963 | 405 |
jpn | Japanese | 0.9851 | 1.0000 | 0.9925 | 530 |
kab | Kabyle | 0.8382 | 0.8959 | 0.8661 | 509 |
kat | Georgian | 1.0000 | 0.9885 | 0.9942 | 260 |
kaz | Kazakh | 0.9896 | 0.9845 | 0.9870 | 193 |
kha | Khasi | 0.9038 | 0.9400 | 0.9216 | 100 |
khm | Khmer | 1.0000 | 1.0000 | 1.0000 | 75 |
kmr | Northern Kurdish (Kurmancî) | 0.9851 | 0.9763 | 0.9807 | 338 |
knc | Central Kanuri | 0.9719 | 0.9886 | 0.9802 | 175 |
kor | Korean | 0.9972 | 0.9832 | 0.9902 | 358 |
kzj | Coastal Kadazan | 0.9615 | 0.9336 | 0.9474 | 241 |
lad | Ladino | 0.7846 | 0.7969 | 0.7907 | 64 |
lat | Latin | 0.9756 | 0.9639 | 0.9697 | 498 |
lfn | Lingua Franca Nova | 0.9745 | 0.9700 | 0.9723 | 434 |
lij | Ligurian | 0.9333 | 0.9333 | 0.9333 | 90 |
lin | Lingala | 0.9765 | 0.9765 | 0.9765 | 213 |
lit | Lithuanian | 0.9864 | 0.9922 | 0.9893 | 512 |
ltz | Luxembourgish | 0.9773 | 0.9348 | 0.9556 | 46 |
lvs | Latvian | 0.9597 | 0.9795 | 0.9695 | 146 |
lzh | Literary Chinese | 0.7692 | 0.8046 | 0.7865 | 87 |
mal | Malayalam | 1.0000 | 1.0000 | 1.0000 | 44 |
mar | Marathi | 0.9961 | 1.0000 | 0.9980 | 509 |
mhr | Meadow Mari | 0.9849 | 0.9751 | 0.9800 | 201 |
mkd | Macedonian | 0.9572 | 0.9480 | 0.9526 | 519 |
mon | Mongolian | 0.9708 | 0.9779 | 0.9744 | 136 |
mus | Muskogee (Creek) | 0.9000 | 0.9643 | 0.9310 | 28 |
mya | Burmese | 1.0000 | 0.9643 | 0.9818 | 28 |
nds | Low German (Low Saxon) | 0.9829 | 0.9710 | 0.9769 | 414 |
nld | Dutch | 0.9662 | 0.9772 | 0.9717 | 527 |
nnb | Nande | 0.9870 | 0.9870 | 0.9870 | 385 |
nno | Norwegian Nynorsk | 0.9585 | 0.9652 | 0.9619 | 575 |
nob | Norwegian Bokmål | 0.9247 | 0.9156 | 0.9201 | 912 |
nst | Naga (Tangshang) | 1.0000 | 1.0000 | 1.0000 | 39 |
nus | Nuer | 0.9903 | 0.9903 | 0.9903 | 103 |
oci | Occitan | 0.9672 | 0.9555 | 0.9613 | 247 |
orv | Old East Slavic | 0.9692 | 0.9692 | 0.9692 | 65 |
oss | Ossetian | 0.9818 | 0.9926 | 0.9872 | 271 |
ota | Ottoman Turkish | 0.9204 | 0.9905 | 0.9541 | 105 |
pam | Kapampangan | 0.9865 | 0.9865 | 0.9865 | 74 |
pcd | Picard | 0.9552 | 0.9846 | 0.9697 | 65 |
pes | Persian | 0.9890 | 0.9890 | 0.9890 | 455 |
pms | Piedmontese | 0.8780 | 0.9000 | 0.8889 | 40 |
pol | Polish | 0.9848 | 0.9829 | 0.9838 | 526 |
por | Portuguese | 0.9687 | 0.9616 | 0.9651 | 547 |
prg | Old Prussian | 0.9800 | 0.9800 | 0.9800 | 50 |
rhg | Rohingya | 0.9780 | 0.9944 | 0.9861 | 179 |
rom | Romani | 0.9302 | 0.8889 | 0.9091 | 45 |
ron | Romanian | 0.9826 | 0.9912 | 0.9869 | 457 |
run | Kirundi | 0.9914 | 0.9665 | 0.9788 | 239 |
rus | Russian | 0.9634 | 0.9814 | 0.9723 | 537 |
sah | Yakut | 1.0000 | 0.9600 | 0.9796 | 50 |
sat | Santali | 0.9942 | 0.9942 | 0.9942 | 171 |
sdh | Southern Kurdish | 0.9423 | 0.9074 | 0.9245 | 54 |
shi | Tashelhit | 0.9706 | 0.8980 | 0.9329 | 147 |
slk | Slovak | 0.9333 | 0.9380 | 0.9356 | 403 |
slv | Slovenian | 0.7018 | 0.8889 | 0.7843 | 45 |
sma | Southern Sami | 0.9600 | 0.9600 | 0.9600 | 100 |
sme | Northern Sami | 0.9980 | 0.9901 | 0.9940 | 504 |
smj | Lule Sami | 0.9820 | 0.9959 | 0.9889 | 493 |
smn | Inari Sami | 0.9950 | 0.9900 | 0.9925 | 201 |
sms | Skolt Sami | 0.9750 | 0.9848 | 0.9799 | 198 |
spa | Spanish | 0.9760 | 0.9601 | 0.9680 | 551 |
sqi | Albanian | 0.9762 | 0.9762 | 0.9762 | 126 |
srp | Serbian | 0.8367 | 0.8216 | 0.8291 | 499 |
swc | Congo Swahili | 0.8727 | 0.8458 | 0.8591 | 454 |
swe | Swedish | 0.9819 | 0.9819 | 0.9819 | 994 |
swg | Swabian | 0.9694 | 0.9406 | 0.9548 | 101 |
swh | Swahili | 0.6798 | 0.7225 | 0.7005 | 191 |
tat | Tatar | 0.9791 | 0.9843 | 0.9817 | 381 |
tgl | Tagalog | 0.9757 | 0.9710 | 0.9734 | 414 |
tha | Thai | 1.0000 | 0.9910 | 0.9955 | 222 |
thv | Tahaggart Tamahaq | 0.6552 | 0.7037 | 0.6786 | 27 |
tig | Tigre | 1.0000 | 1.0000 | 1.0000 | 181 |
tlh | Klingon | 1.0000 | 0.9932 | 0.9966 | 442 |
tok | Toki Pona | 1.0000 | 1.0000 | 1.0000 | 495 |
tpw | Old Tupi | 0.8929 | 0.9259 | 0.9091 | 27 |
tuk | Turkmen | 0.9779 | 0.9603 | 0.9690 | 277 |
tur | Turkish | 0.9908 | 0.9541 | 0.9721 | 567 |
uig | Uyghur | 0.9966 | 0.9900 | 0.9933 | 300 |
ukr | Ukrainian | 0.9831 | 0.9831 | 0.9831 | 534 |
urd | Urdu | 1.0000 | 0.9914 | 0.9957 | 116 |
uzb | Uzbek | 0.8200 | 0.9318 | 0.8723 | 44 |
vie | Vietnamese | 0.9977 | 0.9953 | 0.9965 | 427 |
vol | Volapük | 0.9908 | 0.9908 | 0.9908 | 218 |
war | Waray | 0.9307 | 0.9691 | 0.9495 | 97 |
wuu | Shanghainese | 0.8318 | 0.9036 | 0.8662 | 197 |
xal | Kalmyk | 0.9302 | 0.9524 | 0.9412 | 42 |
xmf | Mingrelian | 0.7419 | 0.8519 | 0.7931 | 27 |
yid | Yiddish | 0.9971 | 1.0000 | 0.9986 | 348 |
yue | Cantonese | 0.9004 | 0.9711 | 0.9344 | 242 |
zgh | Standard Moroccan Tamazight | 0.9873 | 0.9873 | 0.9873 | 158 |
zlm | Malay (Vernacular) | 0.8488 | 0.8902 | 0.8690 | 82 |
zsm | Malay | 0.7606 | 0.7883 | 0.7742 | 274 |
zza | Zaza | 0.9294 | 0.9634 | 0.9461 | 82 |
Accuracy | 0.9591 | 44049 | |||
Weighted avg | 0.9604 | 0.9591 | 0.9595 | 44049 | |
Macro avg | 0.9371 | 0.9474 | 0.9413 | 44049 |