Saibo-creator
commited on
Commit
•
a22d181
1
Parent(s):
5ebd237
create model card
Browse files
README.md
ADDED
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- legal
|
6 |
+
license: apache-2.0
|
7 |
+
metrics:
|
8 |
+
- precision
|
9 |
+
- recall
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
# LEGAL-ROBERTA
|
17 |
+
We introduce LEGAL-ROBERTA, which is a domain-specific language representation model fine-tuned on large-scale legal corpora(4.6 GB).
|
18 |
+
|
19 |
+
## Demo
|
20 |
+
|
21 |
+
|
22 |
+
|
23 |
+
'This <mask> Agreement is between General Motors and John Murray .'
|
24 |
+
|
25 |
+
|
26 |
+
|
27 |
+
| Model | top1 | top2 | top3 | top4 | top5 |
|
28 |
+
| ------------ | ---- | --- | --- | --- | -------- |
|
29 |
+
| Bert | new | current | proposed | marketing | joint |
|
30 |
+
| legalBert | settlement | letter | dealer | master | supplemental |
|
31 |
+
| legalRoberta | License | Settlement | Contract | license | Trust |
|
32 |
+
|
33 |
+
> LegalROberta captures the case
|
34 |
+
|
35 |
+
'The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of Adana Security Directorate'
|
36 |
+
|
37 |
+
|
38 |
+
| Model | top1 | top2 | top3 | top4 | top5 |
|
39 |
+
| ------------ | ---- | --- | --- | --- | -------- |
|
40 |
+
| Bert | torture | rape | abuse | death | violence |
|
41 |
+
| legalBert | torture | detention | arrest | rape | death |
|
42 |
+
| legalRoberta | torture | abuse | insanity | cruelty | confinement |
|
43 |
+
|
44 |
+
'Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products .':
|
45 |
+
|
46 |
+
| Model | top1 | top2 | top3 | top4 | top5 |
|
47 |
+
| ------------ | ---- | --- | --- | --- | -------- |
|
48 |
+
| Bert | farm | livestock | draft | domestic | wild |
|
49 |
+
| legalBert | live | beef | farm | pet | dairy |
|
50 |
+
| legalRoberta | domestic | all | beef | wild | registered |
|
51 |
+
|
52 |
+
## Training data
|
53 |
+
|
54 |
+
The tranining data consists of 3 origins:
|
55 |
+
|
56 |
+
1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
|
57 |
+
1. *1.57GB*
|
58 |
+
2. abbrev:PL
|
59 |
+
3. *clean 1.1GB*
|
60 |
+
|
61 |
+
|
62 |
+
2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
|
63 |
+
1. *raw 5.6*
|
64 |
+
2. abbrev:CAP
|
65 |
+
3. *clean 2.8GB*
|
66 |
+
3. Google Patents Public Data (https://www.kaggle.com/bigquery/patents): The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
|
67 |
+
1. *BigQuery (https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api)*
|
68 |
+
2. abbrev:GPPD(1.1GB,patents-public-data.uspto_oce_litigation.documents)
|
69 |
+
3. *clean 1GB*
|
70 |
+
|
71 |
+
## Training procedure
|
72 |
+
We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
|
73 |
+
|
74 |
+
Fine-tuning configuration:
|
75 |
+
- lr = 5e-5(with lr decay, ends at 4.95e-8)
|
76 |
+
- num_epoch = 3
|
77 |
+
- Total steps = 446500
|
78 |
+
- Total_flos = 2.7365e18
|
79 |
+
|
80 |
+
Loss starts at 1.850 and ends at 0.880
|
81 |
+
The perplexity after fine-tuning on legal corpus = 2.2735
|
82 |
+
|
83 |
+
Device:
|
84 |
+
2*GeForce GTX TITAN X computeCapability: 5.2
|
85 |
+
|
86 |
+
## Eval results
|
87 |
+
We benchmarked the model on two downstream tasks: Multi-Label Classification for Legal Text and Catchphrase Retrieval with Legal Case Description.
|
88 |
+
|
89 |
+
1.LMTC, Legal Multi-Label Text Classification
|
90 |
+
|
91 |
+
Dataset:
|
92 |
+
|
93 |
+
Labels shape: 4271
|
94 |
+
Frequent labels: 739
|
95 |
+
Few labels: 3369
|
96 |
+
Zero labels: 163
|
97 |
+
|
98 |
+
|
99 |
+
Hyperparameters:
|
100 |
+
- lr: 1e-05
|
101 |
+
- batch_size: 4
|
102 |
+
- max_sequence_size: 512
|
103 |
+
- max_label_size: 15
|
104 |
+
- few_threshold: 50
|
105 |
+
- epochs: 10
|
106 |
+
- dropout:0.1
|
107 |
+
- early stop:yes
|
108 |
+
- patience: 3
|
109 |
+
|
110 |
+
|
111 |
+
|
112 |
+
| model | Precision | Recall | F1 | R@10 | P@10 | RP@10 | NDCG@10 |
|
113 |
+
| --------------- | --------- | ------ | ----- | ----- | ----- | ----- | ------- |
|
114 |
+
| LegalBert | **0.866** | 0.439 | 0.582 | 0.749 | 0.368 | 0.749 | 0.753 |
|
115 |
+
| LegalRoberta | 0.859 | **0.457** | **0.596** | **0.750** | **0.369** |**0.750** | **0.754** |
|
116 |
+
| Roberta | 0.858 | 0.440 | 0.582 | 0.743 | 0.365 | 0.743 | 0.746 |
|
117 |
+
|
118 |
+
|
119 |
+
|
120 |
+
tranining time per epoch(including validation ):
|
121 |
+
|
122 |
+
| model(exp_name) | time |
|
123 |
+
| --------------- | --- |
|
124 |
+
| Bert | 1h40min |
|
125 |
+
| Roberta | 2h20min |
|
126 |
+
|
127 |
+
|
128 |
+
|
129 |
+
|
130 |
+
## Limitations:
|
131 |
+
In the Masked Language Model showroom, the tokens have a prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
|
132 |
+
I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
|
133 |
+
|
134 |
+
For example
|
135 |
+
```
|
136 |
+
import transformers
|
137 |
+
tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base')
|
138 |
+
print(tokenizer.tokenize('I love salad'))
|
139 |
+
```
|
140 |
+
Outputs:
|
141 |
+
|
142 |
+
```
|
143 |
+
['I', 'Ġlove', 'Ġsalad']
|
144 |
+
```
|
145 |
+
|
146 |
+
So I think this is not fundamentally linked to the model itself.
|
147 |
+
|
148 |
+
## BibTeX entry and citation info
|
149 |
+
|
150 |
+
|
151 |
+
|
152 |
+
|
153 |
+
|