Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- Herelles/lupan
|
4 |
+
language:
|
5 |
+
- fr
|
6 |
+
tags:
|
7 |
+
- text classification
|
8 |
+
- pytorch
|
9 |
+
- camembert
|
10 |
+
- urban planning
|
11 |
+
- natural risks
|
12 |
+
- risk management
|
13 |
+
- geography
|
14 |
+
---
|
15 |
+
# CamemBERT LUPAN (Local Urban Plans And Natural risks)
|
16 |
+
## Overview
|
17 |
+
|
18 |
+
In France, urban planning and natural risk management operate the Local Land Plans (PLU – Plan Local d'Urbanisme) and the Natural risk prevention plans (PPRn – Plan de Prévention des Risques naturels) containing land use rules. To facilitate automatic extraction of the rules, we manually annotated a number of those documents concerning Montpellier, a rapidly evolving agglomeration exposed to natural risks, then fine-tuned a model.
|
19 |
+
|
20 |
+
This model classifies input text in French to determine if it contains an urban planning rule. It outputs one of 4 classes: Verifiable (indicating the possibility of verification with satellite images), Non-verifiable (indicating impossibility of verification with satellite images), Informative (containing non-strict rules in the form of recommendations), and Not pertinent (absence of any of the above rules). For better quality results, it is recommended to add a title and a subtitle to each textual input.
|
21 |
+
|
22 |
+
For more details please refer to our article: https://www.nature.com/articles/s41597-023-02705-y
|
23 |
+
|
24 |
+
## Training and evaluation data
|
25 |
+
|
26 |
+
The model is fine-tuned on top of CamemBERT using our corpus: https://huggingface.co/datasets/Herelles/lupan
|
27 |
+
|
28 |
+
This is the first corpus in the French language in the fields of urban planning and natural risk management.
|
29 |
+
|
30 |
+
## Example of use
|
31 |
+
|
32 |
+
Attention: to run this code you need to have intalled `transformers`, `torch` and `numpy`. You can do it with `pip install transformers torch numpy`.
|
33 |
+
|
34 |
+
Load nessesary libraries:
|
35 |
+
```
|
36 |
+
from transformers import CamembertTokenizer, CamembertForSequenceClassification
|
37 |
+
|
38 |
+
import torch
|
39 |
+
|
40 |
+
import numpy as np
|
41 |
+
```
|
42 |
+
|
43 |
+
Define tokenizer:
|
44 |
+
```
|
45 |
+
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
|
46 |
+
```
|
47 |
+
|
48 |
+
Define the model:
|
49 |
+
```
|
50 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
51 |
+
|
52 |
+
model = CamembertForSequenceClassification.from_pretrained("herelles/camembert-base-lupan")
|
53 |
+
|
54 |
+
model.to(device)
|
55 |
+
```
|
56 |
+
|
57 |
+
Define segment to predict:
|
58 |
+
```
|
59 |
+
new_segment = '''Article 1 : Occupations ou utilisations du sol interdites
|
60 |
+
|
61 |
+
1) Dans l’ensemble de la zone sont interdits :
|
62 |
+
|
63 |
+
Les constructions destinées à l’habitation ne dépendant pas d’une exploitation agricole autres
|
64 |
+
que celles visées à l’article 2 paragraphe 1).'''
|
65 |
+
```
|
66 |
+
|
67 |
+
Get the prediction:
|
68 |
+
```
|
69 |
+
test_ids = []
|
70 |
+
test_attention_mask = []
|
71 |
+
|
72 |
+
# Apply the tokenizer
|
73 |
+
encoding = tokenizer(new_segment, padding="longest", return_tensors="pt")
|
74 |
+
|
75 |
+
# Extract IDs and Attention Mask
|
76 |
+
test_ids.append(encoding['input_ids'])
|
77 |
+
test_attention_mask.append(encoding['attention_mask'])
|
78 |
+
test_ids = torch.cat(test_ids, dim = 0)
|
79 |
+
test_attention_mask = torch.cat(test_attention_mask, dim = 0)
|
80 |
+
|
81 |
+
# Forward pass, calculate logit predictions
|
82 |
+
with torch.no_grad():
|
83 |
+
output = model(test_ids.to(device), token_type_ids = None, attention_mask = test_attention_mask.to(device))
|
84 |
+
|
85 |
+
prediction = np.argmax(output.logits.cpu().numpy()).flatten().item()
|
86 |
+
|
87 |
+
if prediction == 0:
|
88 |
+
pred_label = 'Not pertinent'
|
89 |
+
elif prediction == 1:
|
90 |
+
pred_label = 'Pertinent (Soft)'
|
91 |
+
elif prediction == 2:
|
92 |
+
pred_label = 'Pertinent (Strict, Non-verifiable)'
|
93 |
+
elif prediction == 3:
|
94 |
+
pred_label = 'Pertinent (Strict, Verifiable)'
|
95 |
+
|
96 |
+
print('Input text: ', new_segment)
|
97 |
+
print('\n\nPredicted Class: ', pred_label)
|
98 |
+
```
|
99 |
+
|
100 |
+
## Citation
|
101 |
+
|
102 |
+
To cite the data set please use:
|
103 |
+
```
|
104 |
+
@article{koptelov2023manually,
|
105 |
+
title={A manually annotated corpus in French for the study of urbanization and the natural risk prevention},
|
106 |
+
author={Koptelov, Maksim and Holveck, Margaux and Cremilleux, Bruno and Reynaud, Justine and Roche, Mathieu and Teisseire, Maguelonne},
|
107 |
+
journal={Scientific Data},
|
108 |
+
volume={10},
|
109 |
+
number={1},
|
110 |
+
pages={818},
|
111 |
+
year={2023},
|
112 |
+
publisher={Nature Publishing Group UK London}
|
113 |
+
}
|
114 |
+
```
|
115 |
+
|
116 |
+
To cite the code please use:
|
117 |
+
```
|
118 |
+
@inproceedings{koptelov2023towards,
|
119 |
+
title={Towards a (Semi-) Automatic Urban Planning Rule Identification in the French Language},
|
120 |
+
author={Koptelov, Maksim and Holveck, Margaux and Cremilleux, Bruno and Reynaud, Justine and Roche, Mathieu and Teisseire, Maguelonne},
|
121 |
+
booktitle={2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)},
|
122 |
+
pages={1--10},
|
123 |
+
year={2023},
|
124 |
+
organization={IEEE}
|
125 |
+
}
|
126 |
+
```
|