Initial commit

Files changed (8) hide show

README.md +111 -0
config.json +23 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+---
+language: sk
+tags:
+- SlovakBERT
+license: mit
+datasets:
+- wikipedia
+- opensubtitles
+- oscar
+- gerulatawebcrawl
+- gerulatamonitoring
+- blbec.online
+---
+# SlovakBERT (base-sized model)
+SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovenko and Slovensko.
+## Intended uses & limitations
+You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
+**IMPORTANT**: The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
+unmasker("Deti sa <mask> na ihrisku.")
+[{'sequence': 'Deti sa hrali na ihrisku.',
+  'score': 0.6355380415916443,
+  'token': 5949,
+  'token_str': ' hrali'},
+ {'sequence': 'Deti sa hrajú na ihrisku.',
+  'score': 0.14731724560260773,
+  'token': 9081,
+  'token_str': ' hrajú'},
+ {'sequence': 'Deti sa zahrali na ihrisku.',
+  'score': 0.05016357824206352,
+  'token': 32553,
+  'token_str': ' zahrali'},
+ {'sequence': 'Deti sa stretli na ihrisku.',
+  'score': 0.041727423667907715,
+  'token': 5964,
+  'token_str': ' stretli'},
+ {'sequence': 'Deti sa učia na ihrisku.',
+  'score': 0.01886524073779583,
+  'token': 18099,
+  'token_str': ' učia'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import RobertaTokenizer, RobertaModel
+tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
+model = RobertaModel.from_pretrained('gerulata/slovakbert')
+text = "Text ktorý sa má embedovať."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import RobertaTokenizer, TFRobertaModel
+tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+model = TFRobertaModel.from_pretrained('roberta-base')
+text = "Text ktorý sa má embedovať."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+Or extract information from the model like this:
+```python
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
+unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.")
+[{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.',
+  'score': 0.7383289933204651,
+  'token': 16621,
+  'token_str': ' 1944'},...]
+```
+# Training data
+The SlovakBERT model was pretrained on the these datasets:
+- Wikipedia (326MB of text),
+- OpenSubtitles (415MB of text),
+- Oscar (4.6GB of text),
+- Gerulata WebCrawl (12.7GB of text) ,
+- Gerulata Monitoring (214 MB of text),
+- blbec.online (4.5GB of text)
+The text was then processed with the following steps:
+- URL and email addresses were replaced with special tokens ("url", "email").
+- Elongated interpunction was reduced (e.g. -- to -).
+- Markdown syntax was deleted.
+- All text content in braces f.g was eliminated to reduce the amount of markup and programming language text.
+We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.
+# Pretraining
+The model was trained in **fairseq** on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.
+## About us
+<a href="https://www.gerulata.com/">
+	<img width="300px" src="https://www.gerulata.com/images/gerulata-logo-blue.png">
+</a>
+Gerulata uses near real-time monitoring, advanced analytics and machine learning to help create a safer, more productive and enjoyable online environment for everyone.
+### BibTeX entry and citation info
+- to be completed

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "_name_or_path": "gerulata/slovakbert",
+  "architectures": [
+    "RobertaForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "type_vocab_size": 1,
+  "vocab_size": 50264
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71bf910b56cca82b2b9bf79b4ed7212cfba711fb3b90cfb79181e97f495ab130
+size 499040675

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c5a18b0c0c0e42251e20f3d5ccfd7ccd87752ee560d326ff0faa31eb4546474
+size 657427592

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"errors": "replace", "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": null, "tokenizer_file": null, "model_max_length": 512, "name_or_path": "sk-roberta-base-300k-voc50264-20gb"}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff