metadata
license: cc-by-4.0
language:
- kk
metrics:
- seqeval
pipeline_tag: token-classification
tags:
- Named Entity Recognition
- NER
widget:
- text: >-
Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
мемлекет.
example_title: Example 1
- text: Ахмет Байтұрсынұлы — қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
example_title: Example 2
- text: >-
Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40
миллиард долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға
мүдделі. Атап айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175
тауар экспорттын ұлғайтуға дайын.
example_title: Example 3
datasets:
- yeshpanovrustem/kaznerd_cleaned
A Named Entity Recognition Model for Kazakh
- The model was inspired by the LREC 2022 paper KazNERD: Kazakh Named Entity Recognition Dataset.
- The original repository for the paper can be found at https://github.com/IS2AI/KazNERD.
KazNERD (cleaned)
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed. As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
Statistics for training (Train), validation (Valid), and test (Test) sets
Unit | Train | Valid | Test | Total |
---|---|---|---|---|
Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
80 / 10 / 10 split
Representation | Train | Valid | Test | Total |
---|---|---|---|---|
AID | 67,582 (79.99%) | 8,439 (9.99%) | 8,467 (10.02%) | 84,488 (100%) |
BID | 19,006 (80.11%) | 2,380 (10.03%) | 2,338 (9.85%) | 23,724 (100%) |
CID | 1,050 (78.89%) | 138 (10.37%) | 143 ( 10.74%) | 1,331 (100%) |
DID | 633 (79.22%) | 82 (10.26%) | 84 (10.51%) | 799 (100%) |
EID | 260 (81.00%) | 27 (8.41%) | 34 (10.59%) | 321 (100%) |
FID | 9 (75.00%) | 1 (8.33%) | 2 (16.67%) | 12 (100%) |
Total | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
Distribution of representations across sets
Representation | Train | Valid | Test | Total |
---|---|---|---|---|
AID | 67,582 (76.33%) | 8,439 (76.25%) | 8,467 (76.50%) | 84,488 (76.34%) |
BID | 19,006 (21.47%) | 2,380 (21.51%) | 2,338 (21.12%) | 23,724 (21.44%) |
CID | 1,050 (1.19%) | 138 (1.25%) | 143 ( 1.29%) | 1,331 (1.20%) |
DID | 633 (0.71%) | 82 (0.74%) | 84 (0.76%) | 799 (0.72%) |
EID | 260 (0.29%) | 27 (0.24%) | 34 (0.31%) | 321 (0.29%) |
FID | 9 (0.01%) | 1 (0.01%) | 2 (0.02%) | 12 (0.01%) |
Total | 88,540 (100.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
Distribution of NEs across sets
NE Class | Train | Valid | Test | Total |
---|---|---|---|---|
ADAGE | 153 (0.14%) | 19 (0.14%) | 17 (0.13%) | 189 (0.14%) |
ART | 1,533 (1.44%) | 155 (1.18%) | 161 (1.23%) | 1,849 (1.40%) |
CARDINAL | 23,135 (21.8%) | 2,878 (21.82%) | 2,789 (21.34%) | 28,802 (21.75%) |
CONTACT | 159 (0.15%) | 18 (0.14%) | 20 (0.15%) | 197 (0.15%) |
DATE | 20,006 (18.85%) | 2,603 (19.74%) | 2,584 (19.77%) | 25,193 (19.03%) |
DISEASE | 1,022 (0.96%) | 121 (0.92%) | 119 (0.91%) | 1,262 (0.95%) |
EVENT | 1,331 (1.25%) | 154 (1.17%) | 154 (1.18%) | 1,639 (1.24%) |
FACILITY | 1,723 (1.62%) | 178 (1.35%) | 197 (1.51%) | 2,098 (1.58%) |
GPE | 13,625 (12.84%) | 1,656 (12.56%) | 1,691 (12.94%) | 16,972 (12.82%) |
LANGUAGE | 350 (0.33%) | 47 (0.36%) | 41 (0.31%) | 438 (0.33%) |
LAW | 419 (0.39%) | 56 (0.42%) | 55 (0.42%) | 530 (0.40%) |
LOCATION | 1,736 (1.64%) | 210 (1.59%) | 208 (1.59%) | 2,154 (1.63%) |
MISCELLANEOUS | 191 (0.18%) | 26 (0.2%) | 26 (0.2%) | 243 (0.18%) |
MONEY | 3,652 (3.44%) | 455 (3.45%) | 427 (3.27%) | 4,534 (3.42%) |
NON_HUMAN | 6 (0.01%) | 1 (0.01%) | 1 (0.01%) | 8 (0.01%) |
NORP | 2,929 (2.76%) | 374 (2.84%) | 368 (2.82%) | 3,671 (2.77%) |
ORDINAL | 3,054 (2.88%) | 385 (2.92%) | 382 (2.92%) | 3,821 (2.89%) |
ORGANISATION | 5,956 (5.61%) | 753 (5.71%) | 718 (5.49%) | 7,427 (5.61%) |
PERCENTAGE | 3,357 (3.16%) | 437 (3.31%) | 462 (3.53%) | 4,256 (3.21%) |
PERSON | 9,817 (9.25%) | 1,175 (8.91%) | 1,151 (8.81%) | 12,143 (9.17%) |
POSITION | 4,844 (4.56%) | 587 (4.45%) | 597 (4.57%) | 6,028 (4.55%) |
PRODUCT | 586 (0.55%) | 73 (0.55%) | 75 (0.57%) | 734 (0.55%) |
PROJECT | 1,681 (1.58%) | 209 (1.58%) | 206 (1.58%) | 2,096 (1.58%) |
QUANTITY | 3,063 (2.89%) | 411 (3.12%) | 403 (3.08%) | 3,877 (2.93%) |
TIME | 1,820 (1.71%) | 208 (1.58%) | 220 (1.68%) | 2,248 (1.70%) |
Total | 106,148 (100%) | 13,189 (100%) | 13,072 (100%) | 132,409 (100%) |