File size: 2,422 Bytes
7587f6c
332838e
3d6d7e4
332838e
 
3d6d7e4
594bc49
 
332838e
cd86f29
 
 
3d6d7e4
7587f6c
332838e
3d6d7e4
332838e
3d6d7e4
332838e
3d6d7e4
332838e
 
 
 
 
 
 
 
3d6d7e4
 
 
 
 
 
 
332838e
3d6d7e4
332838e
3d6d7e4
 
424e314
3d6d7e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00dc46
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
language: fr
license: apache-2.0
tags:
- legal
- feature-extraction
datasets: maastrichtlawtech/bsard
pipeline_tag: fill-mask
widget:
- text: >-
    Chaque commune de la Région peut adopter un <mask> communal de
    développement, applicable à l'ensemble de son territoire.
library_name: transformers
---

# Legal-DistilCamemBERT-base

This is a [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model further pre-trained on 22,000+ legal articles from the Belgian legislation in French.

## Usage

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("maastrichtlawtech/legal-distilcamembert")
model = AutoModel.from_pretrained("maastrichtlawtech/legal-distilcamembert")
```

## Training

#### Background

We utilize the [distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and further pre-train it with a masked language modeling (MLM) objective on legislation in French using the [script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) from Hugging Face.

#### Hyperparameters

We train the model on a single Tesla V100 GPU with 32GBs of memory during 200 epochs (i.e., ~50k steps) using a batch size of 32. We use the AdamW optimizer with an initial learning rate of 5e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.

#### Data

We use the [Belgian Statutory Article Retrieval Dataset (BSARD)](https://huggingface.co/datasets/maastrichtlawtech/bsard) to further pre-train the model. BSARD is a French native dataset for studying legal information retrieval that includes more than 22,600 statutory articles from the Belgian legislation.

## Citation

```bibtex
@inproceedings{louis2023finding,
  title = {Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks},
  author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos},
  booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
  month = may,
  year = {2023},
  address = {Dubrovnik, Croatia},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2023.eacl-main.203/},
  pages = {2753–2768},
}
```
[//]: # (https://arxiv.org/abs/2301.12847)