yeshpanovrustem's picture
Update README.md
b72b769
|
raw
history blame
1.1 kB
metadata
license: cc-by-4.0

A Named Entity Recognition Model for Kazakh

Differences

While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed.

Dataset Unit Train Valid Test Total
KazNERD (Original) Sentences 90,228 (80.06%) 11,167 (9.91%) 11,307 (10.03%) 112,702 (100%)
KazNERD (Cleaned) Sentences 88,540 (80.00%) 11,067 (10.00%) 11,068 (10.00%) 110,675 (100%)
KazNERD (Original) Tokens 1,043,305 (80.11%) 129,223 (9.92%) 129,824 (9.97%) 1,302,352 (100%)
KazNERD (Cleaned) Tokens 1,088,461 (80.04%) 136,021 (10.00%) 135,426 (9.96%) 1,359,908 (100%)