metadata

license: cc-by-4.0

A Named Entity Recognition Model for Kazakh

The model was inspired by the LREC 2022 paper KazNERD: Kazakh Named Entity Recognition Dataset.
The original repository for the paper can be found at https://github.com/IS2AI/KazNERD.

Differences

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed. Removing the tokens caused some changes in the number of sentences, tokens, and named entities (NEs).

Dataset	Unit	Train	Valid	Test	Total
KazNERD (Original)	Sentence	90,228 (80.06%)	11,167 (9.91%)	11,307 (10.03%)	112,702 (100%)
KazNERD (Cleaned)	Sentence	88,540 (80.00%)	11,067 (10.00%)	11,068 (10.00%)	110,675 (100%)
KazNERD (Original)	Token	1,043,305 (80.11%)	129,223 (9.92%)	129,824 (9.97%)	1,302,352 (100%)
KazNERD (Cleaned)	Token	1,088,461 (80.04%)	136,021 (10.00%)	135,426 (9.96%)	1,359,908 (100%)
KazNERD (Original)	NE	109,342 (80.20%)	13,483 (9.89%)	13,508 (9.91%)	136,333 (100%)
KazNERD (Cleaned)	NE	106,148 (80.17%)	13,189 (9.96%)	13,072 (9.87%)	132,409 (100%)