File size: 1,795 Bytes
e87b142
 
 
1532aeb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
license: mit
---
The model generated in the Enrich4All project.<br>
Evaluated the perplexity of MLM Task fine-tuned for construction permits related corpus.<br>
Baseline model:  https://huggingface.co/racai/distilbert-base-romanian-cased <br>
Scripts and corpus used for training: https://github.com/racai-ai/e4all-models

Corpus
---------------

The construction authorization corpus is meant to ease the task of interested people to get informed on the legal framework related to activities like building, repairing, extending, and modifying their living environment, or setup of economic activities like establishing commercial or industrial centers. It is aimed as well to ease and reduce the activity of official representatives of regional administrative centers. The corpus is built to comply with the Romanian legislation in this domain and is structured in sets of labeled questions with a single answer each, covering various categories of issues:
 * Construction activities and operations, including industrial structures, which require or do not require authorization,
 * The necessary steps and documents to be acquired according to the Romanian regulations,
 * validity terms,
 * involved costs.  
 
The data is acquired from two main sources:
 * Internet: official sites, frequently asked questions
 * Personal experiences of people: building permanent or provisory structures, replacing roofs, fences, installing photovoltaic panels, etc.

<br><br>
The construction permits corpus contains 500,351 words in 110 UTF-8 encoded files.

Results
-----------------
| MLM Task                          | Perplexity    |
| --------------------------------- | ------------- |
| Baseline                          | 62.79         |
| Construction Permits Fine-tuning  | 7.13          |