yeshpanovrustem's picture
Update README.md
4c952a9
|
raw
history blame
5.55 kB
metadata
license: cc-by-4.0
language:
  - kk
metrics:
  - seqeval
pipeline_tag: token-classification
tags:
  - Named Entity Recognition
  - NER
widget:
  - text: >-
      Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
      мемлекет.
    example_title: Example 1
  - text: Ахмет Байтұрсынұлы  қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
    example_title: Example 2
  - text: >-
      Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40
      миллиард долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға
      мүдделі. Атап айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175
      тауар экспорттын ұлғайтуға дайын.
    example_title: Example 3
datasets:
  - yeshpanovrustem/kaznerd_cleaned

A Named Entity Recognition Model for Kazakh

KazNERD (cleaned)

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed. As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.

Statistics for training (Train), validation (Valid), and test (Test) sets

Unit Train Valid Test Total
Sentence 88,540 (80.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)
Token 1,088,461 (80.04%) 136,021 (10.00%) 135,426 (9.96%) 1,359,908 (100%)
NE 106,148 (80.17%) 13,189 (9.96%) 13,072 (9.87%) 132,409 (100%)

80 / 10 / 10 split

Representation Train Valid Test Total
AID 67,582 (79.99%) 8,439 (9.99%) 8,467 (10.02%) 84,488 (100%)
BID 19,006 (80.11%) 2,380 (10.03%) 2,338 (9.85%) 23,724 (100%)
CID 1,050 (78.89%) 138 (10.37%) 143 ( 10.74%) 1,331 (100%)
DID 633 (79.22%) 82 (10.26%) 84 (10.51%) 799 (100%)
EID 260 (81.00%) 27 (8.41%) 34 (10.59%) 321 (100%)
FID 9 (75.00%) 1 (8.33%) 2 (16.67%) 12 (100%)
Total 88,540 (80.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)

Distribution of representations across sets

Representation Train Valid Test Total
AID 67,582 (76.33%) 8,439 (76.25%) 8,467 (76.50%) 84,488 (76.34%)
BID 19,006 (21.47%) 2,380 (21.51%) 2,338 (21.12%) 23,724 (21.44%)
CID 1,050 (1.19%) 138 (1.25%) 143 ( 1.29%) 1,331 (1.20%)
DID 633 (0.71%) 82 (0.74%) 84 (0.76%) 799 (0.72%)
EID 260 (0.29%) 27 (0.24%) 34 (0.31%) 321 (0.29%)
FID 9 (0.01%) 1 (0.01%) 2 (0.02%) 12 (0.01%)
Total 88,540 (100.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)

Distribution of NEs across sets

NE Class Train Valid Test Total
ADAGE 153 (0.14%) 19 (0.14%) 17 (0.13%) 189 (0.14%)
ART 1,533 (1.44%) 155 (1.18%) 161 (1.23%) 1,849 (1.40%)
CARDINAL 23,135 (21.8%) 2,878 (21.82%) 2,789 (21.34%) 28,802 (21.75%)
CONTACT 159 (0.15%) 18 (0.14%) 20 (0.15%) 197 (0.15%)
DATE 20,006 (18.85%) 2,603 (19.74%) 2,584 (19.77%) 25,193 (19.03%)
DISEASE 1,022 (0.96%) 121 (0.92%) 119 (0.91%) 1,262 (0.95%)
EVENT 1,331 (1.25%) 154 (1.17%) 154 (1.18%) 1,639 (1.24%)
FACILITY 1,723 (1.62%) 178 (1.35%) 197 (1.51%) 2,098 (1.58%)
GPE 13,625 (12.84%) 1,656 (12.56%) 1,691 (12.94%) 16,972 (12.82%)
LANGUAGE 350 (0.33%) 47 (0.36%) 41 (0.31%) 438 (0.33%)
LAW 419 (0.39%) 56 (0.42%) 55 (0.42%) 530 (0.40%)
LOCATION 1,736 (1.64%) 210 (1.59%) 208 (1.59%) 2,154 (1.63%)
MISCELLANEOUS 191 (0.18%) 26 (0.2%) 26 (0.2%) 243 (0.18%)
MONEY 3,652 (3.44%) 455 (3.45%) 427 (3.27%) 4,534 (3.42%)
NON_HUMAN 6 (0.01%) 1 (0.01%) 1 (0.01%) 8 (0.01%)
NORP 2,929 (2.76%) 374 (2.84%) 368 (2.82%) 3,671 (2.77%)
ORDINAL 3,054 (2.88%) 385 (2.92%) 382 (2.92%) 3,821 (2.89%)
ORGANISATION 5,956 (5.61%) 753 (5.71%) 718 (5.49%) 7,427 (5.61%)
PERCENTAGE 3,357 (3.16%) 437 (3.31%) 462 (3.53%) 4,256 (3.21%)
PERSON 9,817 (9.25%) 1,175 (8.91%) 1,151 (8.81%) 12,143 (9.17%)
POSITION 4,844 (4.56%) 587 (4.45%) 597 (4.57%) 6,028 (4.55%)
PRODUCT 586 (0.55%) 73 (0.55%) 75 (0.57%) 734 (0.55%)
PROJECT 1,681 (1.58%) 209 (1.58%) 206 (1.58%) 2,096 (1.58%)
QUANTITY 3,063 (2.89%) 411 (3.12%) 403 (3.08%) 3,877 (2.93%)
TIME 1,820 (1.71%) 208 (1.58%) 220 (1.68%) 2,248 (1.70%)
Total 106,148 (100%) 13,189 (100%) 13,072 (100%) 132,409 (100%)