nlpaueb commited on
Commit
c6b4b0d
1 Parent(s): 07215e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -5
README.md CHANGED
@@ -18,7 +18,7 @@ LEGAL-BERT is a family of BERT models for the legal domain, intended to assist l
18
 
19
  ---
20
 
21
- I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://arxiv.org/abs/2010.02559)
22
 
23
  ---
24
 
@@ -30,7 +30,7 @@ The pre-training corpora of LEGAL-BERT include:
30
 
31
  * 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
32
 
33
- * 19,867 cases from European Court of Justice (ECJ), also available from EURLEX.
34
 
35
  * 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
36
 
@@ -45,6 +45,7 @@ The pre-training corpora of LEGAL-BERT include:
45
  * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
46
  * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
47
  * Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.
 
48
  ## Models list
49
 
50
  | Model name | Model Path | Training corpora |
@@ -52,9 +53,13 @@ The pre-training corpora of LEGAL-BERT include:
52
  | CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts |
53
  | EURLEX-BERT-BASE | `nlpaueb/bert-base-uncased-eurlex` | EU legislation |
54
  | ECHR-BERT-BASE | `nlpaueb/bert-base-uncased-echr` | ECHR cases |
55
- | LEGAL-BERT-BASE | `nlpaueb/legal-bert-base-uncased` | All |
56
  | LEGAL-BERT-SMALL | `nlpaueb/legal-bert-small-uncased` | All |
57
 
 
 
 
 
58
  ## Load Pretrained Model
59
 
60
  ```python
@@ -97,9 +102,27 @@ model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")
97
 
98
  ## Evaluation on downstream tasks
99
 
100
- Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2018, (https://arxiv.org/abs/2010.02559)
101
 
102
- ## Author
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
105
 
18
 
19
  ---
20
 
21
+ I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)
22
 
23
  ---
24
 
30
 
31
  * 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
32
 
33
+ * 19,867 cases from the European Court of Justice (ECJ), also available from EURLEX.
34
 
35
  * 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
36
 
45
  * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
46
  * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
47
  * Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.
48
+
49
  ## Models list
50
 
51
  | Model name | Model Path | Training corpora |
53
  | CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts |
54
  | EURLEX-BERT-BASE | `nlpaueb/bert-base-uncased-eurlex` | EU legislation |
55
  | ECHR-BERT-BASE | `nlpaueb/bert-base-uncased-echr` | ECHR cases |
56
+ | LEGAL-BERT-BASE * | `nlpaueb/legal-bert-base-uncased` | All |
57
  | LEGAL-BERT-SMALL | `nlpaueb/legal-bert-small-uncased` | All |
58
 
59
+ \* LEGAL-BERT-BASE is the model referred to as LEGAL-BERT-SC in Chalkidis et al. (2020); a model trained from scratch in the legal corpora mentioned below using a newly created vocabulary by a sentence-piece tokenizer trained on the very same corpora.
60
+
61
+ \*\* As many of you expressed interest in the LEGAL-BERT-FP models (those relying on the original BERT-BASE checkpoint), they have been released in Archive.org (https://archive.org/details/legal_bert_fp), as these models are secondary and possibly only interesting for those who aim to dig deeper in the open questions of Chalkidis et al. (2020).
62
+
63
  ## Load Pretrained Model
64
 
65
  ```python
102
 
103
  ## Evaluation on downstream tasks
104
 
105
+ Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2020, (https://aclanthology.org/2020.findings-emnlp.261)
106
 
107
+ ## Author - Publication
108
+
109
+ ```
110
+ @inproceedings{chalkidis-etal-2020-legal,
111
+ title = "{LEGAL}-{BERT}: The Muppets straight out of Law School",
112
+ author = "Chalkidis, Ilias and
113
+ Fergadiotis, Manos and
114
+ Malakasiotis, Prodromos and
115
+ Aletras, Nikolaos and
116
+ Androutsopoulos, Ion",
117
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
118
+ month = nov,
119
+ year = "2020",
120
+ address = "Online",
121
+ publisher = "Association for Computational Linguistics",
122
+ doi = "10.18653/v1/2020.findings-emnlp.261",
123
+ pages = "2898--2904"
124
+ }
125
+ ```
126
 
127
  Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
128