|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- kz-transformers/multidomain-kazakh-dataset |
|
language: |
|
- kk |
|
pipeline_tag: fill-mask |
|
library_name: transformers |
|
widget: |
|
- text: "Әжібай Найманбайұлы — батыр.Албан тайпасының қызылбөрік руынан <mask>." |
|
- text: "<mask> — Қазақстан Республикасының астанасы." |
|
--- |
|
# Kaz-RoBERTa (base-sized model) |
|
|
|
## Model description |
|
|
|
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective. |
|
|
|
## Usage |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational') |
|
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады") |
|
#Out: |
|
# {'score': 0.8131822347640991, |
|
# 'token': 18749, |
|
# 'token_str': ' мағынада', |
|
# 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'}, |
|
# ... |
|
# ...] |
|
``` |
|
## Training data |
|
|
|
The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets: |
|
- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains. |
|
- [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group) |
|
|
|
Together these datasets weigh 25GB of text. |
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of |
|
the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked |
|
with `<s>` and the end of one by `</s>` |
|
|
|
### Pretraining |
|
|
|
The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12, |
|
num_hidden_layers=6. |
|
|
|
|
|
### Contributions |
|
|
|
Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model. |
|
**Point of Contact:** [Sanzhar Murzakhmetov](mailto:sanzharmrz@gmail.com), [Besultan Sagyndyk](mailto:nuxyjlbka@gmail.com) |
|
--- |