ValentinB
commited on
Commit
•
1532aeb
1
Parent(s):
e87b142
Add model files
Browse files- README.md +27 -0
- config.json +25 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
CHANGED
@@ -1,3 +1,30 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
The model generated in the Enrich4All project.<br>
|
5 |
+
Evaluated the perplexity of MLM Task fine-tuned for construction permits related corpus.<br>
|
6 |
+
Baseline model: https://huggingface.co/racai/distilbert-base-romanian-cased <br>
|
7 |
+
Scripts and corpus used for training: https://github.com/racai-ai/e4all-models
|
8 |
+
|
9 |
+
Corpus
|
10 |
+
---------------
|
11 |
+
|
12 |
+
The construction authorization corpus is meant to ease the task of interested people to get informed on the legal framework related to activities like building, repairing, extending, and modifying their living environment, or setup of economic activities like establishing commercial or industrial centers. It is aimed as well to ease and reduce the activity of official representatives of regional administrative centers. The corpus is built to comply with the Romanian legislation in this domain and is structured in sets of labeled questions with a single answer each, covering various categories of issues:
|
13 |
+
* Construction activities and operations, including industrial structures, which require or do not require authorization,
|
14 |
+
* The necessary steps and documents to be acquired according to the Romanian regulations,
|
15 |
+
* validity terms,
|
16 |
+
* involved costs.
|
17 |
+
|
18 |
+
The data is acquired from two main sources:
|
19 |
+
* Internet: official sites, frequently asked questions
|
20 |
+
* Personal experiences of people: building permanent or provisory structures, replacing roofs, fences, installing photovoltaic panels, etc.
|
21 |
+
|
22 |
+
<br><br>
|
23 |
+
The construction permits corpus contains 500,351 words in 110 UTF-8 encoded files.
|
24 |
+
|
25 |
+
Results
|
26 |
+
-----------------
|
27 |
+
| MLM Task | Perplexity |
|
28 |
+
| --------------------------------- | ------------- |
|
29 |
+
| Baseline | 62.79 |
|
30 |
+
| Construction Permits Fine-tuning | 7.13 |
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "racai/distilbert-base-romanian-cased",
|
3 |
+
"activation": "gelu",
|
4 |
+
"architectures": [
|
5 |
+
"DistilBertForMaskedLM"
|
6 |
+
],
|
7 |
+
"attention_dropout": 0.1,
|
8 |
+
"dim": 768,
|
9 |
+
"dropout": 0.1,
|
10 |
+
"hidden_dim": 3072,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"max_position_embeddings": 512,
|
13 |
+
"model_type": "distilbert",
|
14 |
+
"n_heads": 12,
|
15 |
+
"n_layers": 6,
|
16 |
+
"output_hidden_states": true,
|
17 |
+
"pad_token_id": 0,
|
18 |
+
"qa_dropout": 0.1,
|
19 |
+
"seq_classif_dropout": 0.2,
|
20 |
+
"sinusoidal_pos_embds": true,
|
21 |
+
"tie_weights_": true,
|
22 |
+
"torch_dtype": "float32",
|
23 |
+
"transformers_version": "4.15.0",
|
24 |
+
"vocab_size": 50000
|
25 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f45918872a1fc2a9f9f1d2adacba6b30076499ed4be8472f3ffd63807cf81b47
|
3 |
+
size 327897007
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "racai/distilbert-base-romanian-cased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "DistilBertTokenizer"}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:977de6d54eef6238af5b7d474a3bd6ffadb46c386e8bd8dd8519aeaac8d4ddaa
|
3 |
+
size 2927
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|