File size: 2,470 Bytes
98881cf
 
 
 
 
 
 
bb3fdef
91c3983
fcf0d49
8b38d44
2eea2cf
453b01e
2eea2cf
 
 
ee7e74a
2eea2cf
c7dabed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6150653
 
 
 
c7dabed
6150653
 
ec45403
 
 
 
 
 
 
 
 
 
f9ddff8
 
ec45403
2eea2cf
c7dabed
2eea2cf
c7dabed
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: apache-2.0
datasets:
- kz-transformers/multidomain-kazakh-dataset
language:
- kk
pipeline_tag: fill-mask
library_name: transformers
widget:
- text: "Әжібай Найманбайұлы  — батыр.Албан тайпасының қызылбөрік руынан <mask>."
- text: "<mask> — Қазақстан Республикасының астанасы."
---
# Kaz-RoBERTa (base-sized model) 

## Model description

Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.  

## Usage

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
#Out:
# {'score': 0.8131822347640991,
#   'token': 18749,
#   'token_str': ' мағынада',
#   'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
# ...
# ...]
```
## Training data

The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
- [Conversational data](https://beeline.kz/) Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)

Together these datasets weigh 25GB of text.
## Training procedure

### Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of
the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked
with `<s>` and the end of one by `</s>`

### Pretraining

The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512.  MLM probability - 15%, num_attention_heads=12,
num_hidden_layers=6.


### Contributions

Thanks to [@BeksultanSagyndyk](https://github.com/BeksultanSagyndyk), [@SanzharMrz](https://github.com/SanzharMrz) for adding this model.
**Point of Contact:** [Sanzhar Murzakhmetov](mailto:sanzharmrz@gmail.com), [Besultan Sagyndyk](mailto:nuxyjlbka@gmail.com)
---