kz-transformers
/

kaz-roberta-conversational

Inference Endpoints

Model card Files Files and versions Community

kaz-roberta-conversational / README.md

kz-transformers's picture

kz-transformers

Update README.md

f9ddff8 verified about 2 months ago

|

raw history blame contribute delete

No virus

2.47 kB

	---
	license: apache-2.0
	datasets:
	- kz-transformers/multidomain-kazakh-dataset
	language:
	- kk
	pipeline_tag: fill-mask
	library_name: transformers
	widget:
	- text: "Әжібай Найманбайұлы — батыр.Албан тайпасының қызылбөрік руынан <mask>."
	- text: "<mask> — Қазақстан Республикасының астанасы."
	---
	# Kaz-RoBERTa (base-sized model)

	## Model description

	Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.

	## Usage

	You can use this model directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
	>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
	#Out:
	# {'score': 0.8131822347640991,
	# 'token': 18749,
	# 'token_str': ' мағынада',
	# 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
	# ...
	# ...]
	```
	## Training data

	The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
	- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
	- [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)

	Together these datasets weigh 25GB of text.
	## Training procedure

	### Preprocessing

	The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of
	the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked
	with `<s>` and the end of one by `</s>`

	### Pretraining

	The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12,
	num_hidden_layers=6.


	### Contributions

	Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model.
	Point of Contact: [Sanzhar Murzakhmetov](mailto:sanzharmrz@gmail.com), [Besultan Sagyndyk](mailto:nuxyjlbka@gmail.com)
	---