metadata

license: cc-by-4.0

A Named Entity Recognition Model for Kazakh

The model was inspired by the LREC 2022 paper KazNERD: Kazakh Named Entity Recognition Dataset.
The original repository for the paper can be found at https://github.com/IS2AI/KazNERD.

Differences

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed.

Dataset	Unit	Train	Valid	Test	Total
KazNERD (Original)	Sentences	90,228 (80.06%)	11,167 (9.91%)	11,307 (10.03%)	112,702 (100%)
KazNERD (Cleaned)	Sentences	88,540 (80.00%)	11,067 (10.00%)	11,068 (10.00%)	110,675 (100%)
KazNERD (Original)	Tokens	1,043,305 (80.11%)	129,223 (9.92%)	129,824 (9.97%)	1,302,352 (100%)
KazNERD (Cleaned)	Tokens	1,088,461 (80.04%)	136,021 (10.00%)	135,426 (9.96%)	1,359,908 (100%)