language:
- en
tags:
- legal
license: apache-2.0
metrics:
- precision
- recall
LEGAL-ROBERTA
We introduce LEGAL-ROBERTA, which is a domain-specific language representation model fine-tuned on large-scale legal corpora(4.6 GB).
Demo
'This <mask> Agreement is between General Motors and John Murray .'
Model | top1 | top2 | top3 | top4 | top5 |
---|---|---|---|---|---|
Bert | new | current | proposed | marketing | joint |
legalBert | settlement | letter | dealer | master | supplemental |
legalRoberta | License | Settlement | Contract | license | Trust |
LegalROberta captures the case
'The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of Adana Security Directorate'
Model | top1 | top2 | top3 | top4 | top5 |
---|---|---|---|---|---|
Bert | torture | rape | abuse | death | violence |
legalBert | torture | detention | arrest | rape | death |
legalRoberta | torture | abuse | insanity | cruelty | confinement |
'Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products .':
Model | top1 | top2 | top3 | top4 | top5 |
---|---|---|---|---|---|
Bert | farm | livestock | draft | domestic | wild |
legalBert | live | beef | farm | pet | dairy |
legalRoberta | domestic | all | beef | wild | registered |
Training data
The tranining data consists of 3 origins:
Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
- 1.57GB
- abbrev:PL
- clean 1.1GB
Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
- raw 5.6
- abbrev:CAP
- clean 2.8GB
Google Patents Public Data (https://www.kaggle.com/bigquery/patents): The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
- BigQuery (https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api)
- abbrev:GPPD(1.1GB,patents-public-data.uspto_oce_litigation.documents)
- clean 1GB
Training procedure
We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
Fine-tuning configuration:
- lr = 5e-5(with lr decay, ends at 4.95e-8)
- num_epoch = 3
- Total steps = 446500
- Total_flos = 2.7365e18
Loss starts at 1.850 and ends at 0.880 The perplexity after fine-tuning on legal corpus = 2.2735
Device: 2*GeForce GTX TITAN X computeCapability: 5.2
Eval results
We benchmarked the model on two downstream tasks: Multi-Label Classification for Legal Text and Catchphrase Retrieval with Legal Case Description.
1.LMTC, Legal Multi-Label Text Classification
Dataset:
Labels shape: 4271 Frequent labels: 739 Few labels: 3369 Zero labels: 163
Hyperparameters:
- lr: 1e-05
- batch_size: 4
- max_sequence_size: 512
- max_label_size: 15
- few_threshold: 50
- epochs: 10
- dropout:0.1
- early stop:yes
- patience: 3
model | Precision | Recall | F1 | R@10 | P@10 | RP@10 | NDCG@10 |
---|---|---|---|---|---|---|---|
LegalBert | 0.866 | 0.439 | 0.582 | 0.749 | 0.368 | 0.749 | 0.753 |
LegalRoberta | 0.859 | 0.457 | 0.596 | 0.750 | 0.369 | 0.750 | 0.754 |
Roberta | 0.858 | 0.440 | 0.582 | 0.743 | 0.365 | 0.743 | 0.746 |
tranining time per epoch(including validation ):
model(exp_name) | time |
---|---|
Bert | 1h40min |
Roberta | 2h20min |
Limitations:
In the Masked Language Model showroom, the tokens have a prefix Ġ. This seems to be wired but I haven't yet been able to fix it. I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
For example
import transformers
tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base')
print(tokenizer.tokenize('I love salad'))
Outputs:
['I', 'Ġlove', 'Ġsalad']
So I think this is not fundamentally linked to the model itself.