Add model

Browse files

Files changed (7) hide show

LICENSE.md +20 -0
README.md +62 -0
config.json +24 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,20 @@

+Copyright 2021 CopperCityLabs
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+language: uz (cyrl)
+tags:
+- uzbert
+- uzbek
+- bert
+license: MIT
+datasets:
+- webcrawl corpus (~142M words)
+---
+# UzBERT base model (uncased)
+Pretrained model on Uzbek language (Cyrillic script) using a masked
+language modeling and next sentence prediction objectives.
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='coppercitylabs/uzbert-base-uncased')
+>>> unmasker("Алишер Навоий – улуғ ўзбек ва бошқа туркий халқларнинг [MASK], мутафаккири ва давлат арбоби бўлган.")
+[
+    {
+        'token_str': 'шоири',
+        'token': 13587,
+        'score': 0.7974384427070618,
+        'sequence': 'алишер навоий – улуғ ўзбек ва бошқа туркий халқларнинг шоири, мутафаккир ##и ва давлат арбоби бўлган.'
+    },
+    {
+        'token_str': 'олими',
+        'token': 18500,
+        'score': 0.09166576713323593,
+        'sequence': 'алишер навоий – улуғ ўзбек ва бошқа туркий халқларнинг олими, мутафаккир ##и ва давлат арбоби бўлган.'
+    },
+    {
+        'token_str': 'асосчиси',
+        'token': 7469,
+        'score': 0.02451123297214508,
+        'sequence': 'алишер навоий – улуғ ўзбек ва бошқа туркий халқларнинг асосчиси, мутафаккир ##и ва давлат арбоби бўлган.'
+    },
+    {
+        'token_str': 'ёзувчиси',
+        'token': 22439,
+        'score': 0.017601722851395607,
+        'sequence': 'алишер навоий – улуғ ўзбек ва бошқа туркий халқларнинг ёзувчиси, мутафаккир ##и ва давлат арбоби бўлган.'
+    },
+    {
+        'token_str': 'устози',
+        'token': 11494,
+        'score': 0.010115668177604675,
+        'sequence': 'алишер навоий – улуғ ўзбек ва бошқа туркий халқларнинг устози, мутафаккир ##и ва давлат арбоби бўлган.'
+    }
+]
+```
+## Training data
+UzBERT model was pretrained on ~625K news articles.

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "_name_or_path": "/home/b/uzl/src/bert/../..//out/bert/model-1/",
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.8.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b3d626d7b124b38437492d465aaab46841d1a876427d4be2cac427e7060b9ac7
+size 436536363

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "/home/b/uzl/src/bert/../..//out/bert/", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff