File size: 5,539 Bytes
fac28e6
439dbad
fac28e6
 
439dbad
ab5fcb4
439dbad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8b410c
 
439dbad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language: fr
license: cc-by-4.0
---

# Cour de Cassation semi-automatic *titrage* prediction model

Model for the semi-automatic prediction of *titrages* (keyword sequence) from *sommaires* (synthesis of legal cases). 

The models are similar to the automatic models described in [this paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf) and to the model available [here](https://huggingface.co/rbawden/CCASS-pred-titrages-base). If you use this semi-automatic model, please cite our research paper (see [below](#cite)).

## Model description

The model is a transformer-base model trained on parallel data (sommaires-titrages) provided by the Cour de Cassation. The model was intially trained using the Fairseq toolkit, converted to HuggingFace and then fine-tuned on the original training data to smooth out minor differences that arose during the conversion process. Tokenisation is performed using a SentencePiece model, the BPE strategy and a vocab size of 8000.

### Intended uses & limitations

This model is to be used to help in the production of *titrages* for those *sommaires* that do not have them or to complement existing (manually) created *titrages*. 

### How to use

Contrary to the [automatic *titrage* prediction model](https://huggingface.co/rbawden/CCASS-pred-titrages-base) (designed to predict the entire sequence), this model is designed to help in the manual production of *titrages*, by proposing the next *titre* (keyword) in the sequence given a *sommaire* and the beginning of the *titrage*.

Model input is the *matière* (matter) concatenated to the *titres* already decided on (separated by <t>), concatenated to the text from the sommaire separated by the token `<t>`. Each example should be on a single line. E.g. `bail <t> résiliation <t> causes <t> La recommendation du tribunal selon l'article...` (fictive example for illustrative purposes, where the matter=bail, the beginning of the *titrage*=résiliation <t> causes. The maximum input length of the model is 1024 input tokens (after tokenisation).

```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokeniser = AutoTokenizer.from_pretrained("rbawden/CCASS-semi-auto-titrages-base", use_auth_token=True)
model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/CCASS-semi-auto-titrages-base", use_auth_token=True)

matiere_and_titrage_prefix = "matter <t> titre"
sommaire = "full text from the sommaire on a single line"
inputs = tokeniser([matiere_and_titrage_prefix + " <t> " + sommaire], return_tensors='pt')
outputs = model.generate(inputs['input_ids'])
tokeniser.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenisation_spaces=True)
```

### Limitations and bias

The models' predictions should not be taken as ground-truth *titrages* and the final decision should be the expert's. The model is not constrained to predict *titres* that have previously been seen, so this should be taken into account in the deployment of this model as a *titrage* tool in order to avoid the multiplication of different *titres*.


## Training data

Training data is provided by the Cour de Cassation (the original source being Jurinet data, but with pseudo-anonymisation applied). For training, we use a total of 159,836 parallel examples (each example is a sommaire-titrage pair). Our development data consists of 1,833 held-out examples.


## Training procedure

### Preprocessing

We use SentencePiece, the BPE strategy and a joint vocabulary of 8000 tokens. This model was converted into the HuggingFace format and integrates a number of normalisation processes (e.g. removing double doubles, apostrophes and quotes, normalisation of different accent formats, lowercasing).

### Training

The model was initialised trained using Fairseq until convergence on the development set (according to our customised weighted accuracy measure - please see [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf) for more details). The model was then converted to HuggingFace and training continued to smooth out incoherences introduced during the conversion procedure (incompatibilities in the way the SentencePiece and NMT vocabularies are defined, linked to HuggingFace vocabularies being necessarily the same as the tokeniser vocabulary, a constraint that is not imposed in Fairseq).

### Evaluation results

Full results for the initial (automatic) Fairseq models can be found in [the paper](https://hal.inria.fr/hal-03663110/file/LREC_2022___CCass_Inria-camera-ready.pdf).

Results on this semi-automatic model coming soon!

## BibTex entry and citation info
<a name="cite"></a>

If you use this work, please cite the following article:

Thibault Charmet, Inès Cherichi, Matthieu Allain, Urszula Czerwinska, Amaury Fouret, Benoît Sagot and Rachel Bawden, 2022. **Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France’s Court of Cassation Rulings**. In Proceedings of the 13th Language Resources and Evaluation Conference, Marseille, France.

```
@inproceedings{charmet-et-al-2022-complex,
  tite = {Complex Labelling and Similarity Prediction in Legal Texts: Automatic Analysis of France’s Court of Cassation Rulings},
  author = {Charmet, Thibault and Cherichi, Inès and Allain, Matthieu and Czerwinska, Urszula and Fouret, Amaury, and Sagot, Benoît and Bawden, Rachel},
  booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
  year = {2022},
  address = {Marseille, France}
```