yeshpanovrustem
/

xlm-roberta-large-ner-kazakh

Token Classification

Named Entity Recognition

Inference Endpoints

Model card Files Files and versions Community

xlm-roberta-large-ner-kazakh / README.md

yeshpanovrustem's picture

yeshpanovrustem

Update README.md

9afc8dd over 1 year ago

|

2.08 kB

	---
	license: cc-by-4.0
	language:
	- kk
	metrics:
	- seqeval
	pipeline_tag: token-classification
	tags:
	- NER
	- Named Entity Recognition
	widget:
	- text: "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."
	example_title: "Example 1"
	- text: "Ахмет Байтұрсынұлы — қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым."
	example_title: "Example 2"
	---
	# A Named Entity Recognition Model for Kazakh
	- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [KazNERD: Kazakh Named Entity Recognition Dataset](https://aclanthology.org/2022.lrec-1.44).
	- The original repository for the paper can be found at https://github.com/IS2AI/KazNERD.
	## Differences
	While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
	As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed. It is also likely that token numbers were calculated incorrectly in the original dataset and should have been given as 1,120,387 (Train), 136,983 (Valid), 134,540 (Test), and 1,391,910 (Total).

	Dataset \| Unit \| Train \| Valid \| Test \| Total \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	KazNERD (Original)\| Sentence \| 90,228 (80.06%) \| 11,167 (9.91%)\| 11,307 (10.03%) \| 112,702 (100%) \|
	KazNERD (Cleaned) \| Sentence \| 88,540 (80.00%) \| 11,067 (10.00%) \| 11,068 (10.00%) \| 110,675 (100%) \|
	KazNERD (Original)\| Token \| 1,043,305 (80.11%) \| 129,223 (9.92%)\| 129,824 (9.97%) \| 1,302,352 (100%) \|
	KazNERD (Cleaned) \| Token \| 1,088,461 (80.04%) \| 136,021 (10.00%) \| 135,426 (9.96%) \| 1,359,908 (100%) \|
	KazNERD (Original)\| NE \| 109,342 (80.20%) \| 13,483 (9.89%)\| 13,508 (9.91%) \| 136,333 (100%) \|
	KazNERD (Cleaned) \| NE \| 106,148 (80.17%) \| 13,189 (9.96%) \| 13,072 (9.87%) \| 132,409 (100%) \|