Saibo-creator
/

legal-roberta-base

+---
+language:
+- en
+tags:
+- legal
+license: apache-2.0
+metrics:
+- precision
+- recall
+---
+# LEGAL-ROBERTA
+We introduce LEGAL-ROBERTA, which is a domain-specific language representation model fine-tuned on large-scale legal corpora(4.6 GB).
+## Demo
+'This <mask> Agreement is between General Motors and John Murray .'
+| Model        | top1 | top2    |  top3   | top4    | top5 |
+| ------------ | ---- | --- | --- | --- | -------- |
+| Bert         |  new    |   current  | proposed    |  marketing   |    joint      |
+| legalBert    |  settlement    | letter    |  dealer   |  master   |   supplemental       |
+| legalRoberta | License     |  Settlement   |  Contract   |  license   |   Trust       |
+> LegalROberta captures the case
+'The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of Adana Security Directorate'
+| Model        | top1 | top2    |  top3   | top4    |  top5 |
+| ------------ | ---- | --- | --- | --- | -------- |
+| Bert    |  torture    |  rape   |  abuse   |  death   |     violence     |
+| legalBert    |  torture    | detention    | arrest    |  rape   |   death       |
+| legalRoberta | torture     |  abuse   |  insanity   |   cruelty  |    confinement      |
+'Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products .':
+| Model        | top1 | top2    |  top3   | top4    | top5 |
+| ------------ | ---- | --- | --- | --- | -------- |
+| Bert         |  farm    |  livestock   | draft    | domestic    |   wild       |
+| legalBert    |   live   |  beef   |  farm   | pet    |      dairy    |
+| legalRoberta | domestic     |  all   |  beef   |   wild  |    registered      |
+## Training data
+The tranining data consists of 3 origins:
+1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
+    1. *1.57GB*
+    2. abbrev:PL
+    3. *clean 1.1GB*
+2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
+    1. *raw 5.6*
+    2. abbrev:CAP
+    3. *clean 2.8GB*
+3. Google Patents Public Data (https://www.kaggle.com/bigquery/patents): The Google Patents Public Data contains a collection of publicly accessible, connected database tables for empirical analysis of the international patent system.
+    1. *BigQuery (https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api)*
+    2. abbrev:GPPD(1.1GB,patents-public-data.uspto_oce_litigation.documents)
+    3. *clean 1GB*
+## Training procedure
+We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
+Fine-tuning configuration:
+- lr = 5e-5(with lr decay, ends at 4.95e-8)
+- num_epoch = 3
+- Total steps = 446500
+- Total_flos = 2.7365e18
+Loss starts at 1.850 and ends at 0.880
+The perplexity after fine-tuning on legal corpus = 2.2735
+Device:
+2*GeForce GTX TITAN X computeCapability: 5.2
+## Eval results
+We benchmarked the model on two downstream tasks: Multi-Label Classification for Legal Text and Catchphrase Retrieval with Legal Case Description.
+1.LMTC, Legal Multi-Label Text Classification
+Dataset:
+Labels shape:    4271
+Frequent labels: 739
+Few labels:      3369
+Zero labels:     163
+Hyperparameters:
+- lr: 1e-05
+- batch_size: 4
+- max_sequence_size: 512
+- max_label_size: 15
+- few_threshold: 50
+- epochs: 10
+- dropout:0.1
+- early stop:yes
+- patience: 3
+ | model | Precision | Recall | F1    | R@10  | P@10  | RP@10 | NDCG@10 |
+ | --------------- | --------- | ------ | ----- | ----- | ----- | ----- | ------- |
+ | LegalBert       | **0.866**    | 0.439  | 0.582 | 0.749 | 0.368 | 0.749 | 0.753   |
+ | LegalRoberta    | 0.859     | **0.457**  | **0.596** | **0.750** | **0.369** |**0.750** | **0.754**   |
+ | Roberta         | 0.858     | 0.440  | 0.582 | 0.743 | 0.365 | 0.743 | 0.746   |
+tranining time per epoch(including validation ):
+ | model(exp_name) | time    |
+ | --------------- | --- |
+ | Bert       | 1h40min    |
+ | Roberta    | 2h20min    |
+## Limitations:
+In the Masked Language Model showroom, the tokens have a prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
+I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
+For example
+```
+import transformers
+tokenizer = transformers.RobertaTokenizer.from_pretrained('roberta-base')
+print(tokenizer.tokenize('I love salad'))
+```
+Outputs:
+```
+['I', 'Ġlove', 'Ġsalad']
+```
+So I think this is not fundamentally linked to the model itself.
+## BibTeX entry and citation info