metadata

language:
  - en
tags:
  - legal
license: apache-2.0
metrics:
  - precision
  - recall

LEGAL-ROBERTA

We introduce LEGAL-ROBERTA, which is a domain-specific language representation model fine-tuned on large-scale legal corpora(4.6 GB).

Demo

'This <mask> Agreement is between General Motors and John Murray .'

Model	top1	top2	top3	top4	top5
Bert	new	current	proposed	marketing	joint
legalBert	settlement	letter	dealer	master	supplemental
legalRoberta	License	Settlement	Contract	license	Trust

LegalROberta captures the case

'The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of Adana Security Directorate'

Model	top1	top2	top3	top4	top5
Bert	torture	rape	abuse	death	violence
legalBert	torture	detention	arrest	rape	death
legalRoberta	torture	abuse	insanity	cruelty	confinement

'Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products .':

Model	top1	top2	top3	top4	top5
Bert	farm	livestock	draft	domestic	wild
legalBert	live	beef	farm	pet	dairy
legalRoberta	domestic	all	beef	wild	registered

Training data

The tranining data consists of 3 origins:

Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
1. 1.57GB
2. abbrev:PL
3. clean 1.1GB
Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
1. raw 5.6
2. abbrev:CAP
3. clean 2.8GB
Google Patents Public Data (https://www.kaggle.com/bigquery/patents): The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
1. BigQuery (https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api)
2. abbrev:GPPD(1.1GB,patents-public-data.uspto_oce_litigation.documents)
3. clean 1GB

Training procedure

We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.

Fine-tuning configuration:

lr = 5e-5(with lr decay, ends at 4.95e-8)
num_epoch = 3
Total steps = 446500
Total_flos = 2.7365e18

Loss starts at 1.850 and ends at 0.880 The perplexity after fine-tuning on legal corpus = 2.2735

Device: 2*GeForce GTX TITAN X computeCapability: 5.2

Eval results

We benchmarked the model on two downstream tasks: Multi-Label Classification for Legal Text and Catchphrase Retrieval with Legal Case Description.

1.LMTC, Legal Multi-Label Text Classification

Dataset:

Labels shape: 4271 Frequent labels: 739 Few labels: 3369 Zero labels: 163

Hyperparameters:

lr: 1e-05
batch_size: 4
max_sequence_size: 512
max_label_size: 15
few_threshold: 50
epochs: 10
dropout:0.1
early stop:yes
patience: 3

model	Precision	Recall	F1	R@10	P@10	RP@10	NDCG@10
LegalBert	0.866	0.439	0.582	0.749	0.368	0.749	0.753
LegalRoberta	0.859	0.457	0.596	0.750	0.369	0.750	0.754
Roberta	0.858	0.440	0.582	0.743	0.365	0.743	0.746

tranining time per epoch(including validation ):

model(exp_name)	time
Bert	1h40min
Roberta	2h20min

Limitations:

In the Masked Language Model showroom, the tokens have a prefix Ġ. This seems to be wired but I haven't yet been able to fix it. I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.

For example

import transformers
tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base')
print(tokenizer.tokenize('I love salad'))

Outputs:

['I', 'Ġlove', 'Ġsalad']

So I think this is not fundamentally linked to the model itself.

Saibo-creator
/

legal-roberta-base