Add model

Browse files

Files changed (8) hide show

LICENSE.md +20 -0
README.md +44 -0
config.json +52 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,20 @@

+Copyright 2021 CopperCityLabs
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+language: uz
+tags:
+- uzbek
+- cyrillic
+- news category classifier
+license: MIT
+datasets:
+- webcrawl
+---
+# Uzbek news category classifier (based on UzBERT)
+UzBERT fine-tuned to classify news articles into one of the following
+categories:
+- дунё
+- жамият
+- жиноят
+- иқтисодиёт
+- маданият
+- реклама
+- саломатлик
+- сиёсат
+- спорт
+- фан ва техника
+- шоу-бизнес
+## How to use
+```python
+>>> from transformers import pipeline
+>>> classifier = pipeline('text-classification', model='coppercitylabs/uzbek-news-category-classifier')
+>>> text = """Маҳоратли пара-енгил атлетикачимиз Ҳусниддин Норбеков Токио-2020 Паралимпия ўйинларида ғалаба қозониб, делегациямиз ҳисобига навбатдаги олтин медални келтирди. Бу ҳақда МОҚ хабар берди.
+Норбеков ҳозиргина ядро улоқтириш дастурида ўз ғалабасини тантана қилди. Ушбу машқда вакилимиз 16:13 метр натижа билан энг яхши кўрсаткични қайд этди.
+Шу тариқа, делегациямиз ҳисобидаги медаллар сони 16 (6 та олтин, 4 та кумуш ва 6 та бронза) тага етди. Кейинги кун дастурларида иштирок этадиган ҳамюртларимизга омад тилаб қоламиз!"""
+>>> classifier(text)
+[{'label': 'спорт', 'score': 0.9865401983261108}]
+```
+## Fine-tuning data
+Fine-tuned on ~60K news articles for 3 epochs.

config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "_name_or_path": "/home/b/workspace/nlp/nlp-showcase/out/02-news-cateogry-classifier//checkpoint-2886",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "\u0441\u0430\u043b\u043e\u043c\u0430\u0442\u043b\u0438\u043a",
+    "1": "\u0436\u0438\u043d\u043e\u044f\u0442",
+    "2": "\u0441\u0438\u0451\u0441\u0430\u0442",
+    "3": "\u043c\u0430\u0434\u0430\u043d\u0438\u044f\u0442",
+    "4": "\u0444\u0430\u043d \u0432\u0430 \u0442\u0435\u0445\u043d\u0438\u043a\u0430",
+    "5": "\u0434\u0443\u043d\u0451",
+    "6": "\u0441\u043f\u043e\u0440\u0442",
+    "7": "\u0436\u0430\u043c\u0438\u044f\u0442",
+    "8": "\u0438\u049b\u0442\u0438\u0441\u043e\u0434\u0438\u0451\u0442",
+    "9": "\u0440\u0435\u043a\u043b\u0430\u043c\u0430",
+    "10": "\u0448\u043e\u0443-\u0431\u0438\u0437\u043d\u0435\u0441"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "\u0434\u0443\u043d\u0451": 5,
+    "\u0436\u0430\u043c\u0438\u044f\u0442": 7,
+    "\u0436\u0438\u043d\u043e\u044f\u0442": 1,
+    "\u0438\u049b\u0442\u0438\u0441\u043e\u0434\u0438\u0451\u0442": 8,
+    "\u043c\u0430\u0434\u0430\u043d\u0438\u044f\u0442": 3,
+    "\u0440\u0435\u043a\u043b\u0430\u043c\u0430": 9,
+    "\u0441\u0430\u043b\u043e\u043c\u0430\u0442\u043b\u0438\u043a": 0,
+    "\u0441\u0438\u0451\u0441\u0430\u0442": 2,
+    "\u0441\u043f\u043e\u0440\u0442": 6,
+    "\u0444\u0430\u043d \u0432\u0430 \u0442\u0435\u0445\u043d\u0438\u043a\u0430": 4,
+    "\u0448\u043e\u0443-\u0431\u0438\u0437\u043d\u0435\u0441": 10
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.9.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eadda71568f57657e656dadf9db0429a7dcb75232d7225a019282c3438223c81
+size 436440493

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "coppercitylabs/uzbert-base-uncased", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff