--- license: cc-by-4.0 language: - kk metrics: - seqeval pipeline_tag: token-classification tags: - NER - Named Entity Recognition widget: - text: "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет." example_title: "Example 1" - text: "Ахмет Байтұрсынұлы — қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым." example_title: "Example 2" - text: "Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40 миллиард долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға мүдделі. Атап айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175 тауар экспорттын ұлғайтуға дайын." example_title: "Example 3" --- # A Named Entity Recognition Model for Kazakh - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44). - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*. ## KazNERD (cleaned) While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed. As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed. ### Statistics for training (Train), validation (Valid), and test (Test) sets | Unit | Train | Valid | Test | Total | | :---: | :---: | :---: | :---: | :---: | | Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) | | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) | | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) | ### 80 / 10 / 10 split |Representation| Train | Valid | Test | Total | | :---: | :---: | :---: | :---: | :---: | | **AID** | 67,582 (79.99%) | 8,439 (9.99%) | 8,467 (10.02%)| 84,488 (100%) | | **BID** | 19,006 (80.11%) | 2,380 (10.03%) | 2,338 (9.85%)| 23,724 (100%) | | **CID** | 1,050 (78.89%) | 138 (10.37%) | 143 ( 10.74%) | 1,331 (100%) | | **DID** | 633 (79.22%) | 82 (10.26%) | 84 (10.51%) | 799 (100%) | | **EID** | 260 (81.00%) | 27 (8.41%) | 34 (10.59%)| 321 (100%) | | **FID** | 9 (75.00%) | 1 (8.33%)| 2 (16.67%)| 12 (100%) | |**Total**| **88,540 (80.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** | ### Distribution of representations across sets |Representation| Train | Valid | Test | Total | | :---: | :---: | :---: | :---: | :---: | | **AID** | 67,582 (76.33%) | 8,439 (76.25%) | 8,467 (76.50%)| 84,488 (76.34%) | | **BID** | 19,006 (21.47%) | 2,380 (21.51%) | 2,338 (21.12%)| 23,724 (21.44%) | | **CID** | 1,050 (1.19%) | 138 (1.25%) | 143 ( 1.29%) | 1,331 (1.20%) | | **DID** | 633 (0.71%) | 82 (0.74%) | 84 (0.76%) | 799 (0.72%) | | **EID** | 260 (0.29%) | 27 (0.24%) | 34 (0.31%)| 321 (0.29%) | | **FID** | 9 (0.01%) | 1 (0.01%)| 2 (0.02%)| 12 (0.01%) | |**Total**| **88,540 (100.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** |