antoinelouis's picture
ad81538 verified
language: fr
license: apache-2.0
- legal
- feature-extraction
datasets: maastrichtlawtech/bsard
pipeline_tag: fill-mask
- text: >-
Chaque commune de la Région peut adopter un <mask> communal de
développement, applicable à l'ensemble de son territoire.
library_name: transformers
# Legal-CamemBERT-base
This is a [CamemBERT-base]( model further pre-trained on 22,000+ legal articles from the Belgian legislation in French.
## Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("maastrichtlawtech/legal-camembert-base")
model = AutoModel.from_pretrained("maastrichtlawtech/legal-camembert-base")
## Training
#### Background
We utilize the [camembert-base]( checkpoint and further pre-train it with a masked language modeling (MLM) objective on legislation in French using the [script]( from Hugging Face.
#### Hyperparameters
We train the model on a single Tesla V100 GPU with 32GBs of memory during 200 epochs (i.e., ~50k steps) using a batch size of 32. We use the AdamW optimizer with an initial learning rate of 5e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
#### Data
We use the [Belgian Statutory Article Retrieval Dataset (BSARD)]( to further pre-train the model. BSARD is a French native dataset for studying legal information retrieval that includes more than 22,600 statutory articles from the Belgian legislation.
## Citation
title = {Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks},
author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos},
booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
month = may,
year = {2023},
address = {Dubrovnik, Croatia},
publisher = {Association for Computational Linguistics},
url = {},
pages = {2753–2768},
[//]: # (