yeshpanovrustem commited on
Commit
6687578
1 Parent(s): 05c21b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -10
README.md CHANGED
@@ -21,15 +21,12 @@ widget:
21
  # A Named Entity Recognition Model for Kazakh
22
  - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
23
  - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
24
- ## Differences
25
  While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
26
- As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed. It is also likely that token numbers were calculated incorrectly in the original dataset and should have been given as 1,120,387 (Train), 136,983 (Valid), 134,540 (Test), and 1,391,910 (Total).
27
 
28
- Dataset | Unit | Train | Valid | Test | Total |
29
- | :---: | :---: | :---: | :---: | :---: | :---: |
30
- KazNERD (Original)| Sentence | 90,228 (80.06%) | 11,167 (9.91%)| 11,307 (10.03%) | 112,702 (100%) |
31
- KazNERD (Cleaned) | Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
32
- KazNERD (Original)| Token | 1,043,305 (80.11%) | 129,223 (9.92%)| 129,824 (9.97%) | 1,302,352 (100%) |
33
- KazNERD (Cleaned) | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
34
- KazNERD (Original)| NE | 109,342 (80.20%) | 13,483 (9.89%)| 13,508 (9.91%) | 136,333 (100%) |
35
- KazNERD (Cleaned) | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
 
21
  # A Named Entity Recognition Model for Kazakh
22
  - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
23
  - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
24
+ ## KazNERD (cleaned)
25
  While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
26
+ As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
27
 
28
+ | Unit | Train | Valid | Test | Total |
29
+ | :---: | :---: | :---: | :---: | :---: |
30
+ | Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
31
+ | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
32
+ | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |