--- inference: false language: - bg license: mit datasets: - oscar - chitanka - wikipedia tags: - torch --- # BERT BASE (cased) finetuned on Bulgarian part-of-speech data Pretrained model on Bulgarian language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is cased: it does make a difference between bulgarian and Bulgarian. The training data is Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/). It was finetuned on public part-of-speech Bulgarian data. Then, it was compressed via [progressive module replacing](https://arxiv.org/abs/2002.02925). ### How to use Here is how to use this model in PyTorch: ```python >>> from transformers import pipeline >>> >>> model = pipeline( >>> 'token-classification', >>> model='rmihaylov/bert-base-pos-theseus-bg', >>> tokenizer='rmihaylov/bert-base-pos-theseus-bg', >>> device=0, >>> revision=None) >>> output = model('Здравей, аз се казвам Иван.') >>> print(output) [{'end': 7, 'entity': 'INTJ', 'index': 1, 'score': 0.9640711, 'start': 0, 'word': '▁Здравей'}, {'end': 8, 'entity': 'PUNCT', 'index': 2, 'score': 0.9998927, 'start': 7, 'word': ','}, {'end': 11, 'entity': 'PRON', 'index': 3, 'score': 0.9998872, 'start': 8, 'word': '▁аз'}, {'end': 14, 'entity': 'PRON', 'index': 4, 'score': 0.99990034, 'start': 11, 'word': '▁се'}, {'end': 21, 'entity': 'VERB', 'index': 5, 'score': 0.99989736, 'start': 14, 'word': '▁казвам'}, {'end': 26, 'entity': 'PROPN', 'index': 6, 'score': 0.99990785, 'start': 21, 'word': '▁Иван'}, {'end': 27, 'entity': 'PUNCT', 'index': 7, 'score': 0.9999685, 'start': 26, 'word': '.'}] ```