Add model files

Files changed (8) hide show

README.md +26 -0
config.json +25 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,29 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+The model generated in the Enrich4All project.<br>
+Evaluated the perplexity of MLM Task fine-tuned for construction permits related corpus.<br>
+Baseline model:  https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1 <br>
+Scripts and corpus used for training: https://github.com/racai-ai/e4all-models
+Corpus
+---------------
+The construction authorization corpus is meant to ease the task of interested people to get informed on the legal framework related to activities like building, repairing, extending, and modifying their living environment, or setup of economic activities like establishing commercial or industrial centers. It is aimed as well to ease and reduce the activity of official representatives of regional administrative centers. The corpus is built to comply with the Romanian legislation in this domain and is structured in sets of labeled questions with a single answer each, covering various categories of issues:
+ * Construction activities and operations, including industrial structures, which require or do not require authorization,
+ * The necessary steps and documents to be acquired according to the Romanian regulations,
+ * validity terms,
+ * involved costs.
+The data is acquired from two main sources:
+ * Internet: official sites, frequently asked questions
+ * Personal experiences of people: building permanent or provisory structures, replacing roofs, fences, installing photovoltaic panels, etc.
+<br><br>
+The construction permits corpus contains 500,351 words in 110 UTF-8 encoded files.
+Results
+-----------------
+| MLM Task                         | Perplexity    |
+| ---------------------------------| ------------- |
+| Baseline                         | 6.71          |
+| Construction Permits Fine-tuning | 3.81          |

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "dumitrescustefan/bert-base-romanian-cased-v1",
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.15.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 50000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:61e4cb6ee420da7895385b71a5b74b57490785937d028c6a03c54cdb9e748402
+size 498056363

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "dumitrescustefan/bert-base-romanian-cased-v1", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0fc9e43636370fa25e2c07fb2b186d1f239410fc188d19b79d50aacd8a87d06
+size 2927

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff