yeshpanovrustem's picture
Update README.md
faa072f
|
raw
history blame
1.39 kB
metadata
license: cc-by-4.0

A Named Entity Recognition Model for Kazakh

Differences

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed. Removing the tokens caused some changes in the number of sentences, tokens, and named entities (NEs).

Dataset Unit Train Valid Test Total
KazNERD (Original) Sentence 90,228 (80.06%) 11,167 (9.91%) 11,307 (10.03%) 112,702 (100%)
KazNERD (Cleaned) Sentence 88,540 (80.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)
KazNERD (Original) Token 1,043,305 (80.11%) 129,223 (9.92%) 129,824 (9.97%) 1,302,352 (100%)
KazNERD (Cleaned) Token 1,088,461 (80.04%) 136,021 (10.00%) 135,426 (9.96%) 1,359,908 (100%)
KazNERD (Original) NE 109,342 (80.20%) 13,483 (9.89%) 13,508 (9.91%) 136,333 (100%)
KazNERD (Cleaned) NE 106,148 (80.17%) 13,189 (9.96%) 13,072 (9.87%) 132,409 (100%)