kiddothe2b commited on
Commit
61b3676
1 Parent(s): c2c73de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ pipeline_tag: fill-mask
4
+ license: cc-by-sa-4.0
5
+ tags:
6
+ - legal
7
+ - long-documents
8
+ model-index:
9
+ - name: lexlms/legal-longformer-large
10
+ results: []
11
+ widget:
12
+ - text: "The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of police."
13
+ datasets:
14
+ - lexlms/lex_files
15
  ---
16
+
17
+ # Legal Longformer (large)
18
+
19
+ This is a derivative model based on the [LexLM (large)](https://huggingface.co/lexlms/legal-roberta-large) RoBERTa model.
20
+ All model parameters where cloned from the original model, while the positional embeddings were extended by cloning the original embeddings multiple times following [Beltagy et al. (2020)](https://arxiv.org/abs/2004.05150) using a python script similar to this one (https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb).
21
+
22
+ ## Model description
23
+
24
+ LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
25
+ * We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
26
+ * We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
27
+ * We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
28
+ * We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
29
+ * We consider mixed cased models, similar to all recently developed large PLMs.
30
+
31
+
32
+ ### Citation
33
+
34
+ [*Ilias Chalkidis\*, Nicolas Garneau\*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard.*
35
+ *LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.*
36
+ *2022. In the Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.*](https://arxiv.org/abs/2305.07507)
37
+ ```
38
+ @inproceedings{chalkidis-garneau-etal-2023-lexlms,
39
+ title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
40
+ author = "Chalkidis*, Ilias and
41
+ Garneau*, Nicolas and
42
+ Goanta, Catalina and
43
+ Katz, Daniel Martin and
44
+ Søgaard, Anders",
45
+ booktitle = "Proceedings of the 61h Annual Meeting of the Association for Computational Linguistics",
46
+ month = june,
47
+ year = "2023",
48
+ address = "Toronto, Canada",
49
+ publisher = "Association for Computational Linguistics",
50
+ url = "https://arxiv.org/abs/2305.07507",
51
+ }
52
+ ```