yeshpanovrustem's picture
Update README.md
9afc8dd
|
raw
history blame
2.08 kB
metadata
license: cc-by-4.0
language:
  - kk
metrics:
  - seqeval
pipeline_tag: token-classification
tags:
  - NER
  - Named Entity Recognition
widget:
  - text: >-
      Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
      мемлекет.
    example_title: Example 1
  - text: Ахмет Байтұрсынұлы  қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
    example_title: Example 2

A Named Entity Recognition Model for Kazakh

Differences

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed. As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed. It is also likely that token numbers were calculated incorrectly in the original dataset and should have been given as 1,120,387 (Train), 136,983 (Valid), 134,540 (Test), and 1,391,910 (Total).

Dataset Unit Train Valid Test Total
KazNERD (Original) Sentence 90,228 (80.06%) 11,167 (9.91%) 11,307 (10.03%) 112,702 (100%)
KazNERD (Cleaned) Sentence 88,540 (80.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)
KazNERD (Original) Token 1,043,305 (80.11%) 129,223 (9.92%) 129,824 (9.97%) 1,302,352 (100%)
KazNERD (Cleaned) Token 1,088,461 (80.04%) 136,021 (10.00%) 135,426 (9.96%) 1,359,908 (100%)
KazNERD (Original) NE 109,342 (80.20%) 13,483 (9.89%) 13,508 (9.91%) 136,333 (100%)
KazNERD (Cleaned) NE 106,148 (80.17%) 13,189 (9.96%) 13,072 (9.87%) 132,409 (100%)