nlpaueb
/

legal-bert-base-uncased

@@ -18,7 +18,7 @@ LEGAL-BERT is a family of BERT models for the legal domain, intended to assist l
 ---
-I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://arxiv.org/abs/2010.02559)
 ---
@@ -30,7 +30,7 @@ The pre-training corpora of LEGAL-BERT include:
 * 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
-* 19,867 cases from European Court of Justice (ECJ), also available from EURLEX.
 * 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
@@ -45,6 +45,7 @@ The pre-training corpora of LEGAL-BERT include:
 * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
 * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
 * Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.
 ## Models list
 | Model name          | Model Path                            | Training corpora    |
@@ -52,9 +53,13 @@ The pre-training corpora of LEGAL-BERT include:
 | CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts        |
 | EURLEX-BERT-BASE    | `nlpaueb/bert-base-uncased-eurlex`    | EU legislation      |
 | ECHR-BERT-BASE      | `nlpaueb/bert-base-uncased-echr`      | ECHR cases          |
-| LEGAL-BERT-BASE     | `nlpaueb/legal-bert-base-uncased`     | All                 |
 | LEGAL-BERT-SMALL    | `nlpaueb/legal-bert-small-uncased`    | All                 |
 ## Load Pretrained Model
 ```python
@@ -97,9 +102,27 @@ model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")
 ## Evaluation on downstream tasks
-Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2018, (https://arxiv.org/abs/2010.02559)
-## Author
 Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)

 ---
+I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)
 ---
 * 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
+* 19,867 cases from the European Court of Justice (ECJ), also available from EURLEX.
 * 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
 * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
 * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
 * Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.
 ## Models list
 | Model name          | Model Path                            | Training corpora    |
 | CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts        |
 | EURLEX-BERT-BASE    | `nlpaueb/bert-base-uncased-eurlex`    | EU legislation      |
 | ECHR-BERT-BASE      | `nlpaueb/bert-base-uncased-echr`      | ECHR cases          |
+| LEGAL-BERT-BASE *     | `nlpaueb/legal-bert-base-uncased`     | All                 |
 | LEGAL-BERT-SMALL    | `nlpaueb/legal-bert-small-uncased`    | All                 |
+\* LEGAL-BERT-BASE is the model referred to as LEGAL-BERT-SC in Chalkidis et al. (2020); a model trained from scratch in the legal corpora mentioned below using a newly created vocabulary by a sentence-piece tokenizer trained on the very same corpora.
+\*\* As many of you expressed interest in the LEGAL-BERT-FP models (those relying on the original BERT-BASE checkpoint), they have been released in Archive.org (https://archive.org/details/legal_bert_fp), as these models are secondary and possibly only interesting for those who aim to dig deeper in the open questions of Chalkidis et al. (2020).
 ## Load Pretrained Model
 ```python
 ## Evaluation on downstream tasks
+Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2020, (https://aclanthology.org/2020.findings-emnlp.261)
+## Author - Publication
+```
+@inproceedings{chalkidis-etal-2020-legal,
+    title = "{LEGAL}-{BERT}: The Muppets straight out of Law School",
+    author = "Chalkidis, Ilias  and
+      Fergadiotis, Manos  and
+      Malakasiotis, Prodromos  and
+      Aletras, Nikolaos  and
+      Androutsopoulos, Ion",
+    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
+    month = nov,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    doi = "10.18653/v1/2020.findings-emnlp.261",
+    pages = "2898--2904"
+}
+```
 Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)