nb-nordic-lid / README.md
versae's picture
Update README.md
03f0615
|
raw
history blame
18.2 kB
metadata
license: openrail

Nordic language identification

This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, a model that identifies between the 13 most common languages in the Nordic countries, and a model that extends that 159 languages in the world.

nordic-lid.bin

Trained on sentences from the GiellaT's Tranlation Memories and Wortschatz's corpora.

ISO-639-3 Language Precision Recall F1-Score Support
dan Danish 0.9720 0.9838 0.9779 494
eng English 0.9980 0.9940 0.9960 502
fao Faroese 0.9920 0.9940 0.9930 499
fin Finnish 1.0000 1.0000 1.0000 500
isl Icelandic 0.9900 0.9920 0.9910 499
nno Norwegian Nynorsk 0.9920 0.9861 0.9890 503
nob Norwegian Bokmål 0.9840 0.9743 0.9791 505
sma Southern Sami 0.9800 0.9703 0.9751 101
sme Northern Sami 1.0000 0.9921 0.9960 504
smj Lule Sami 0.9920 0.9960 0.9940 498
smn Inari Sami 0.9950 1.0000 0.9975 199
sms Skolt Sami 0.9900 0.9950 0.9925 199
swe Swedish 0.9860 0.9920 0.9890 497
Accuracy 0.9905 5500
Weighted avg 0.9906 0.9905 0.9905 5500
Macro avg 0.9901 0.9900 0.9900 5500

nordic-lid_all.bin

Additionally trained on sentences from Taoteba.

ISO-639-3 Language Precision Recall F1-Score Support
afr Afrikaans 0.9476 0.9476 0.9476 191
ara Arabic 0.9708 0.9472 0.9588 492
arq Algerian Arabic 0.9478 0.9237 0.9356 118
arz Egyptian Arabic 0.6316 0.7660 0.6923 47
asm Assamese 0.9828 0.9884 0.9856 173
avk Kotava 0.9791 0.9894 0.9842 189
aze Azerbaijani 0.9707 0.9789 0.9748 237
bel Belarusian 0.9892 0.9733 0.9812 375
ben Bengali 0.9872 0.9872 0.9872 235
ber Berber 0.8881 0.8388 0.8627 577
bos Bosnian 0.1310 0.3333 0.1880 33
bre Breton 0.9648 0.9786 0.9716 280
bua Buryat 0.9111 0.9111 0.9111 45
bul Bulgarian 0.9597 0.9662 0.9630 444
cat Catalan 0.9538 0.9475 0.9507 305
cbk Chavacano 0.9627 0.9773 0.9699 132
ceb Cebuano 0.8205 0.8533 0.8366 75
ces Czech 0.9606 0.9740 0.9672 500
chv Chuvash 0.9756 0.9877 0.9816 81
ckb Central Kurdish (Soranî) 0.9751 0.9915 0.9832 355
ckt Chukchi 0.9615 1.0000 0.9804 25
cmn Mandarin Chinese 0.9530 0.8743 0.9120 557
cor Cornish 0.9945 0.9628 0.9784 188
csb Kashubian 0.9574 1.0000 0.9783 45
cym Welsh 0.9375 0.9615 0.9494 78
dan Danish 0.9401 0.9363 0.9382 1005
deu German 0.9853 0.9781 0.9817 549
dsb Lower Sorbian 0.8704 0.8246 0.8468 57
dtp Central Dusun 0.8881 0.9549 0.9203 133
ell Greek 0.9979 0.9979 0.9979 475
eng English 0.9895 0.9839 0.9867 1055
epo Esperanto 0.9817 0.9926 0.9871 540
est Estonian 0.9545 0.9711 0.9628 173
eus Basque 0.9844 0.9583 0.9712 264
fao Faroese 0.9820 0.9859 0.9840 498
fin Finnish 0.9932 0.9780 0.9855 1045
fkv Kven Finnish 0.6154 0.8889 0.7273 18
fra French 0.9871 0.9908 0.9890 542
frr North Frisian 0.9640 0.9710 0.9675 138
fry Frisian 0.6774 0.9545 0.7925 22
gcf Guadeloupean Creole French 0.9619 1.0000 0.9806 101
gla Scottish Gaelic 0.9412 0.9796 0.9600 49
gle Irish 0.9635 0.9778 0.9706 135
glg Galician 0.9104 0.9369 0.9234 206
gos Gronings 0.9549 0.9588 0.9569 243
grc Ancient Greek 0.9828 0.9828 0.9828 58
grn Guarani 0.9684 0.9935 0.9808 154
guc Wayuu 0.9111 0.9762 0.9425 42
hau Hausa 0.9814 0.9953 0.9883 425
heb Hebrew 1.0000 1.0000 1.0000 536
hin Hindi 1.0000 0.9974 0.9987 391
hoc Ho 0.9429 0.9167 0.9296 36
hrv Croatian 0.7447 0.6119 0.6718 286
hrx Hunsrik 0.8727 0.9231 0.8972 52
hsb Upper Sorbian 0.8400 0.8289 0.8344 76
hun Hungarian 0.9853 0.9926 0.9889 539
hye Armenian 1.0000 1.0000 1.0000 225
ido Ido 0.9791 0.9563 0.9676 343
ile Interlingue 0.9352 0.9416 0.9384 291
ilo Ilocano 0.9917 0.9600 0.9756 125
ina Interlingua 0.9558 0.9621 0.9589 449
ind Indonesian 0.8526 0.8203 0.8361 423
isl Icelandic 0.9863 0.9897 0.9880 871
ita Italian 0.9817 0.9711 0.9764 553
jav Javanese 0.9600 0.9600 0.9600 50
jbo Lojban 1.0000 0.9926 0.9963 405
jpn Japanese 0.9851 1.0000 0.9925 530
kab Kabyle 0.8382 0.8959 0.8661 509
kat Georgian 1.0000 0.9885 0.9942 260
kaz Kazakh 0.9896 0.9845 0.9870 193
kha Khasi 0.9038 0.9400 0.9216 100
khm Khmer 1.0000 1.0000 1.0000 75
kmr Northern Kurdish (Kurmancî) 0.9851 0.9763 0.9807 338
knc Central Kanuri 0.9719 0.9886 0.9802 175
kor Korean 0.9972 0.9832 0.9902 358
kzj Coastal Kadazan 0.9615 0.9336 0.9474 241
lad Ladino 0.7846 0.7969 0.7907 64
lat Latin 0.9756 0.9639 0.9697 498
lfn Lingua Franca Nova 0.9745 0.9700 0.9723 434
lij Ligurian 0.9333 0.9333 0.9333 90
lin Lingala 0.9765 0.9765 0.9765 213
lit Lithuanian 0.9864 0.9922 0.9893 512
ltz Luxembourgish 0.9773 0.9348 0.9556 46
lvs Latvian 0.9597 0.9795 0.9695 146
lzh Literary Chinese 0.7692 0.8046 0.7865 87
mal Malayalam 1.0000 1.0000 1.0000 44
mar Marathi 0.9961 1.0000 0.9980 509
mhr Meadow Mari 0.9849 0.9751 0.9800 201
mkd Macedonian 0.9572 0.9480 0.9526 519
mon Mongolian 0.9708 0.9779 0.9744 136
mus Muskogee (Creek) 0.9000 0.9643 0.9310 28
mya Burmese 1.0000 0.9643 0.9818 28
nds Low German (Low Saxon) 0.9829 0.9710 0.9769 414
nld Dutch 0.9662 0.9772 0.9717 527
nnb Nande 0.9870 0.9870 0.9870 385
nno Norwegian Nynorsk 0.9585 0.9652 0.9619 575
nob Norwegian Bokmål 0.9247 0.9156 0.9201 912
nst Naga (Tangshang) 1.0000 1.0000 1.0000 39
nus Nuer 0.9903 0.9903 0.9903 103
oci Occitan 0.9672 0.9555 0.9613 247
orv Old East Slavic 0.9692 0.9692 0.9692 65
oss Ossetian 0.9818 0.9926 0.9872 271
ota Ottoman Turkish 0.9204 0.9905 0.9541 105
pam Kapampangan 0.9865 0.9865 0.9865 74
pcd Picard 0.9552 0.9846 0.9697 65
pes Persian 0.9890 0.9890 0.9890 455
pms Piedmontese 0.8780 0.9000 0.8889 40
pol Polish 0.9848 0.9829 0.9838 526
por Portuguese 0.9687 0.9616 0.9651 547
prg Old Prussian 0.9800 0.9800 0.9800 50
rhg Rohingya 0.9780 0.9944 0.9861 179
rom Romani 0.9302 0.8889 0.9091 45
ron Romanian 0.9826 0.9912 0.9869 457
run Kirundi 0.9914 0.9665 0.9788 239
rus Russian 0.9634 0.9814 0.9723 537
sah Yakut 1.0000 0.9600 0.9796 50
sat Santali 0.9942 0.9942 0.9942 171
sdh Southern Kurdish 0.9423 0.9074 0.9245 54
shi Tashelhit 0.9706 0.8980 0.9329 147
slk Slovak 0.9333 0.9380 0.9356 403
slv Slovenian 0.7018 0.8889 0.7843 45
sma Southern Sami 0.9600 0.9600 0.9600 100
sme Northern Sami 0.9980 0.9901 0.9940 504
smj Lule Sami 0.9820 0.9959 0.9889 493
smn Inari Sami 0.9950 0.9900 0.9925 201
sms Skolt Sami 0.9750 0.9848 0.9799 198
spa Spanish 0.9760 0.9601 0.9680 551
sqi Albanian 0.9762 0.9762 0.9762 126
srp Serbian 0.8367 0.8216 0.8291 499
swc Congo Swahili 0.8727 0.8458 0.8591 454
swe Swedish 0.9819 0.9819 0.9819 994
swg Swabian 0.9694 0.9406 0.9548 101
swh Swahili 0.6798 0.7225 0.7005 191
tat Tatar 0.9791 0.9843 0.9817 381
tgl Tagalog 0.9757 0.9710 0.9734 414
tha Thai 1.0000 0.9910 0.9955 222
thv Tahaggart Tamahaq 0.6552 0.7037 0.6786 27
tig Tigre 1.0000 1.0000 1.0000 181
tlh Klingon 1.0000 0.9932 0.9966 442
tok Toki Pona 1.0000 1.0000 1.0000 495
tpw Old Tupi 0.8929 0.9259 0.9091 27
tuk Turkmen 0.9779 0.9603 0.9690 277
tur Turkish 0.9908 0.9541 0.9721 567
uig Uyghur 0.9966 0.9900 0.9933 300
ukr Ukrainian 0.9831 0.9831 0.9831 534
urd Urdu 1.0000 0.9914 0.9957 116
uzb Uzbek 0.8200 0.9318 0.8723 44
vie Vietnamese 0.9977 0.9953 0.9965 427
vol Volapük 0.9908 0.9908 0.9908 218
war Waray 0.9307 0.9691 0.9495 97
wuu Shanghainese 0.8318 0.9036 0.8662 197
xal Kalmyk 0.9302 0.9524 0.9412 42
xmf Mingrelian 0.7419 0.8519 0.7931 27
yid Yiddish 0.9971 1.0000 0.9986 348
yue Cantonese 0.9004 0.9711 0.9344 242
zgh Standard Moroccan Tamazight 0.9873 0.9873 0.9873 158
zlm Malay (Vernacular) 0.8488 0.8902 0.8690 82
zsm Malay 0.7606 0.7883 0.7742 274
zza Zaza 0.9294 0.9634 0.9461 82
Accuracy 0.9591 44049
Weighted avg 0.9604 0.9591 0.9595 44049
Macro avg 0.9371 0.9474 0.9413 44049