File size: 2,075 Bytes
ec15925
 
eec1a9a
 
 
9afc8dd
eec1a9a
 
 
 
e6c62a9
6a36449
 
 
 
ec15925
2ef6220
 
22fb2c9
 
9185cf4
ce2a5c9
b72b769
 
 
faa072f
 
 
 
 
eec1a9a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
license: cc-by-4.0
language:
- kk
metrics:
- seqeval
pipeline_tag: token-classification
tags:
- NER
- Named Entity Recognition
widget:
- text: "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."
  example_title: "Example 1"
- text: "Ахмет Байтұрсынұлы — қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым."
  example_title: "Example 2"
---
# A Named Entity Recognition Model for Kazakh
- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
- The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
## Differences
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed. It is also likely that token numbers were calculated incorrectly in the original dataset and should have been given as 1,120,387 (Train), 136,983 (Valid), 134,540 (Test), and 1,391,910 (Total).

Dataset | Unit | Train | Valid | Test | Total |
| :---: | :---: | :---: | :---: | :---: | :---: |
KazNERD (Original)| Sentence | 90,228 (80.06%) | 11,167 (9.91%)| 11,307 (10.03%) | 112,702 (100%) | 
KazNERD (Cleaned) | Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) | 
KazNERD (Original)| Token | 1,043,305 (80.11%) | 129,223 (9.92%)| 129,824 (9.97%) | 1,302,352 (100%) | 
KazNERD (Cleaned) | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) | 
KazNERD (Original)| NE | 109,342 (80.20%) | 13,483 (9.89%)| 13,508 (9.91%) | 136,333 (100%) | 
KazNERD (Cleaned) | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |