yeshpanovrustem
/

xlm-roberta-large-ner-kazakh

Token Classification

Named Entity Recognition

Inference Endpoints

Model card Files Files and versions Community

yeshpanovrustem commited on May 20, 2023

Commit

b72b769

•

1 Parent(s): 22fb2c9

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -5,4 +5,11 @@ license: cc-by-4.0
 - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
 - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
 ## Differences
-While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed.

 - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
 - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
 ## Differences
+While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens were removed.
+Dataset | Unit | Train | Valid | Test | Total |
+| :---: | :---: | :---: | :---: | :---: | :---: |
+KazNERD (Original)| Sentences | 90,228 (80.06%) | 11,167 (9.91%)| 11,307 (10.03%) | 112,702 (100%) |
+KazNERD (Cleaned) | Sentences | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
+KazNERD (Original)| Tokens | 1,043,305 (80.11%) | 129,223 (9.92%)| 129,824 (9.97%) | 1,302,352 (100%) |
+KazNERD (Cleaned) | Tokens | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |