Saibo-creator commited on
Commit
a22d181
1 Parent(s): 5ebd237

create model card

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - legal
6
+ license: apache-2.0
7
+ metrics:
8
+ - precision
9
+ - recall
10
+ ---
11
+
12
+
13
+
14
+
15
+
16
+ # LEGAL-ROBERTA
17
+ We introduce LEGAL-ROBERTA, which is a domain-specific language representation model fine-tuned on large-scale legal corpora(4.6 GB).
18
+
19
+ ## Demo
20
+
21
+
22
+
23
+ 'This <mask> Agreement is between General Motors and John Murray .'
24
+
25
+
26
+
27
+ | Model | top1 | top2 | top3 | top4 | top5 |
28
+ | ------------ | ---- | --- | --- | --- | -------- |
29
+ | Bert | new | current | proposed | marketing | joint |
30
+ | legalBert | settlement | letter | dealer | master | supplemental |
31
+ | legalRoberta | License | Settlement | Contract | license | Trust |
32
+
33
+ > LegalROberta captures the case
34
+
35
+ 'The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of Adana Security Directorate'
36
+
37
+
38
+ | Model | top1 | top2 | top3 | top4 | top5 |
39
+ | ------------ | ---- | --- | --- | --- | -------- |
40
+ | Bert | torture | rape | abuse | death | violence |
41
+ | legalBert | torture | detention | arrest | rape | death |
42
+ | legalRoberta | torture | abuse | insanity | cruelty | confinement |
43
+
44
+ 'Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products .':
45
+
46
+ | Model | top1 | top2 | top3 | top4 | top5 |
47
+ | ------------ | ---- | --- | --- | --- | -------- |
48
+ | Bert | farm | livestock | draft | domestic | wild |
49
+ | legalBert | live | beef | farm | pet | dairy |
50
+ | legalRoberta | domestic | all | beef | wild | registered |
51
+
52
+ ## Training data
53
+
54
+ The tranining data consists of 3 origins:
55
+
56
+ 1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
57
+ 1. *1.57GB*
58
+ 2. abbrev:PL
59
+ 3. *clean 1.1GB*
60
+
61
+
62
+ 2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
63
+ 1. *raw 5.6*
64
+ 2. abbrev:CAP
65
+ 3. *clean 2.8GB*
66
+ 3. Google Patents Public Data (https://www.kaggle.com/bigquery/patents): The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
67
+ 1. *BigQuery (https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api)*
68
+ 2. abbrev:GPPD(1.1GB,patents-public-data.uspto_oce_litigation.documents)
69
+ 3. *clean 1GB*
70
+
71
+ ## Training procedure
72
+ We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
73
+
74
+ Fine-tuning configuration:
75
+ - lr = 5e-5(with lr decay, ends at 4.95e-8)
76
+ - num_epoch = 3
77
+ - Total steps = 446500
78
+ - Total_flos = 2.7365e18
79
+
80
+ Loss starts at 1.850 and ends at 0.880
81
+ The perplexity after fine-tuning on legal corpus = 2.2735
82
+
83
+ Device:
84
+ 2*GeForce GTX TITAN X computeCapability: 5.2
85
+
86
+ ## Eval results
87
+ We benchmarked the model on two downstream tasks: Multi-Label Classification for Legal Text and Catchphrase Retrieval with Legal Case Description.
88
+
89
+ 1.LMTC, Legal Multi-Label Text Classification
90
+
91
+ Dataset:
92
+
93
+ Labels shape: 4271
94
+ Frequent labels: 739
95
+ Few labels: 3369
96
+ Zero labels: 163
97
+
98
+
99
+ Hyperparameters:
100
+ - lr: 1e-05
101
+ - batch_size: 4
102
+ - max_sequence_size: 512
103
+ - max_label_size: 15
104
+ - few_threshold: 50
105
+ - epochs: 10
106
+ - dropout:0.1
107
+ - early stop:yes
108
+ - patience: 3
109
+
110
+
111
+
112
+ | model | Precision | Recall | F1 | R@10 | P@10 | RP@10 | NDCG@10 |
113
+ | --------------- | --------- | ------ | ----- | ----- | ----- | ----- | ------- |
114
+ | LegalBert | **0.866** | 0.439 | 0.582 | 0.749 | 0.368 | 0.749 | 0.753 |
115
+ | LegalRoberta | 0.859 | **0.457** | **0.596** | **0.750** | **0.369** |**0.750** | **0.754** |
116
+ | Roberta | 0.858 | 0.440 | 0.582 | 0.743 | 0.365 | 0.743 | 0.746 |
117
+
118
+
119
+
120
+ tranining time per epoch(including validation ):
121
+
122
+ | model(exp_name) | time |
123
+ | --------------- | --- |
124
+ | Bert | 1h40min |
125
+ | Roberta | 2h20min |
126
+
127
+
128
+
129
+
130
+ ## Limitations:
131
+ In the Masked Language Model showroom, the tokens have a prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
132
+ I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
133
+
134
+ For example
135
+ ```
136
+ import transformers
137
+ tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base')
138
+ print(tokenizer.tokenize('I love salad'))
139
+ ```
140
+ Outputs:
141
+
142
+ ```
143
+ ['I', 'Ġlove', 'Ġsalad']
144
+ ```
145
+
146
+ So I think this is not fundamentally linked to the model itself.
147
+
148
+ ## BibTeX entry and citation info
149
+
150
+
151
+
152
+
153
+