--- inference: false language: - bg license: mit datasets: - oscar - chitanka - wikipedia tags: - torch --- # BERT BASE (cased) Pretrained model on Bulgarian language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is cased: it does make a difference between bulgarian and Bulgarian. The training data is Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/). The model was compressed via [progressive module replacing](https://arxiv.org/abs/2002.02925). ### How to use Here is how to use this model in PyTorch: ```python >>> from transformers import pipeline >>> >>> model = pipeline( >>> 'fill-mask', >>> model='rmihaylov/bert-base-theseus-bg', >>> tokenizer='rmihaylov/bert-base-theseus-bg', >>> device=0, >>> revision=None) >>> output = model("София е [MASK] на България.") >>> print(output) [{'score': 0.1586454212665558, 'sequence': 'София е столица на България.', 'token': 76074, 'token_str': 'столица'}, {'score': 0.12992817163467407, 'sequence': 'София е столица на България.', 'token': 2659, 'token_str': 'столица'}, {'score': 0.06064048036932945, 'sequence': 'София е Перлата на България.', 'token': 102146, 'token_str': 'Перлата'}, {'score': 0.034687548875808716, 'sequence': 'София е представителката на България.', 'token': 105456, 'token_str': 'представителката'}, {'score': 0.03053216263651848, 'sequence': 'София е присъединяването на България.', 'token': 18749, 'token_str': 'присъединяването'}] ```