ValentinB commited on
Commit
1532aeb
1 Parent(s): e87b142

Add model files

Browse files
README.md CHANGED
@@ -1,3 +1,30 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ The model generated in the Enrich4All project.<br>
5
+ Evaluated the perplexity of MLM Task fine-tuned for construction permits related corpus.<br>
6
+ Baseline model: https://huggingface.co/racai/distilbert-base-romanian-cased <br>
7
+ Scripts and corpus used for training: https://github.com/racai-ai/e4all-models
8
+
9
+ Corpus
10
+ ---------------
11
+
12
+ The construction authorization corpus is meant to ease the task of interested people to get informed on the legal framework related to activities like building, repairing, extending, and modifying their living environment, or setup of economic activities like establishing commercial or industrial centers. It is aimed as well to ease and reduce the activity of official representatives of regional administrative centers. The corpus is built to comply with the Romanian legislation in this domain and is structured in sets of labeled questions with a single answer each, covering various categories of issues:
13
+ * Construction activities and operations, including industrial structures, which require or do not require authorization,
14
+ * The necessary steps and documents to be acquired according to the Romanian regulations,
15
+ * validity terms,
16
+ * involved costs.
17
+
18
+ The data is acquired from two main sources:
19
+ * Internet: official sites, frequently asked questions
20
+ * Personal experiences of people: building permanent or provisory structures, replacing roofs, fences, installing photovoltaic panels, etc.
21
+
22
+ <br><br>
23
+ The construction permits corpus contains 500,351 words in 110 UTF-8 encoded files.
24
+
25
+ Results
26
+ -----------------
27
+ | MLM Task | Perplexity |
28
+ | --------------------------------- | ------------- |
29
+ | Baseline | 62.79 |
30
+ | Construction Permits Fine-tuning | 7.13 |
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "racai/distilbert-base-romanian-cased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForMaskedLM"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_hidden_states": true,
17
+ "pad_token_id": 0,
18
+ "qa_dropout": 0.1,
19
+ "seq_classif_dropout": 0.2,
20
+ "sinusoidal_pos_embds": true,
21
+ "tie_weights_": true,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.15.0",
24
+ "vocab_size": 50000
25
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f45918872a1fc2a9f9f1d2adacba6b30076499ed4be8472f3ffd63807cf81b47
3
+ size 327897007
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "racai/distilbert-base-romanian-cased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:977de6d54eef6238af5b7d474a3bd6ffadb46c386e8bd8dd8519aeaac8d4ddaa
3
+ size 2927
vocab.txt ADDED
The diff for this file is too large to render. See raw diff