Update README.md

98b0d0e verified 4 months ago

No virus

4.48 kB

	---
	datasets:
	- Herelles/lupan
	language:
	- fr
	tags:
	- text classification
	- pytorch
	- camembert
	- urban planning
	- natural risks
	- risk management
	- geography
	inference: false
	---
	# CamemBERT LUPAN (Local Urban Plans And Natural risks)
	## Overview

	In France, urban planning and natural risk management operate the Local Land Plans (PLU – Plan Local d'Urbanisme) and the Natural risk prevention plans (PPRn – Plan de Prévention des Risques naturels) containing land use rules. To facilitate automatic extraction of the rules, we manually annotated a number of those documents concerning Montpellier, a rapidly evolving agglomeration exposed to natural risks, then fine-tuned a model.

	This model classifies input text in French to determine if it contains an urban planning rule. It outputs one of 4 classes: Verifiable (indicating the possibility of verification with satellite images), Non-verifiable (indicating impossibility of verification with satellite images), Informative (containing non-strict rules in the form of recommendations), and Not pertinent (absence of any of the above rules). For better quality results, it is recommended to add a title and a subtitle to each textual input.

	For more details please refer to our article: https://www.nature.com/articles/s41597-023-02705-y

	## Training and evaluation data

	The model is fine-tuned on top of CamemBERT using our corpus:
	https://huggingface.co/datasets/Herelles/lupan

	This is the first corpus in the French language in the fields of urban planning and natural risk management.

	## Example of use

	Attention: to run this code you need to have intalled `transformers`, `torch` and `numpy`. You can do it with `pip install transformers torch numpy`

	Load necessary libraries:
	```
	from transformers import CamembertTokenizer, CamembertForSequenceClassification

	import torch

	import numpy as np
	```

	Define tokenizer:
	```
	tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
	```

	Define the model:
	```
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	model = CamembertForSequenceClassification.from_pretrained("herelles/camembert-base-lupan")

	model.to(device)
	```

	Define segment to predict:
	```
	new_segment = '''Article 1 : Occupations ou utilisations du sol interdites

	1) Dans l’ensemble de la zone sont interdits :

	Les constructions destinées à l’habitation ne dépendant pas d’une exploitation agricole autres
	que celles visées à l’article 2 paragraphe 1).'''
	```

	Get the prediction:
	```
	test_ids = []
	test_attention_mask = []

	# Apply the tokenizer
	encoding = tokenizer(new_segment, padding="longest", return_tensors="pt")

	# Extract IDs and Attention Mask
	test_ids.append(encoding['input_ids'])
	test_attention_mask.append(encoding['attention_mask'])
	test_ids = torch.cat(test_ids, dim = 0)
	test_attention_mask = torch.cat(test_attention_mask, dim = 0)

	# Forward pass, calculate logit predictions
	with torch.no_grad():
	output = model(test_ids.to(device), token_type_ids = None, attention_mask = test_attention_mask.to(device))

	prediction = np.argmax(output.logits.cpu().numpy()).flatten().item()

	if prediction == 0:
	pred_label = 'Not pertinent'
	elif prediction == 1:
	pred_label = 'Pertinent (Soft)'
	elif prediction == 2:
	pred_label = 'Pertinent (Strict, Non-verifiable)'
	elif prediction == 3:
	pred_label = 'Pertinent (Strict, Verifiable)'

	print('Input text: ', new_segment)
	print('\n\nPredicted Class: ', pred_label)
	```

	## Online demo
	- https://huggingface.co/spaces/Herelles/segments-lupan

	## Citation

	To cite the data set please use:
	```
	@article{koptelov2023manually,
	title={A manually annotated corpus in French for the study of urbanization and the natural risk prevention},
	author={Koptelov, Maksim and Holveck, Margaux and Cremilleux, Bruno and Reynaud, Justine and Roche, Mathieu and Teisseire, Maguelonne},
	journal={Scientific Data},
	volume={10},
	number={1},
	pages={818},
	year={2023},
	publisher={Nature Publishing Group UK London}
	}
	```

	To cite the code please use:
	```
	@inproceedings{koptelov2023towards,
	title={Towards a (Semi-) Automatic Urban Planning Rule Identification in the French Language},
	author={Koptelov, Maksim and Holveck, Margaux and Cremilleux, Bruno and Reynaud, Justine and Roche, Mathieu and Teisseire, Maguelonne},
	booktitle={2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA)},
	pages={1--10},
	year={2023},
	organization={IEEE}
	}
	```