kiddothe2b commited on
Commit
e9fc905
1 Parent(s): 6e3e9a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -8
README.md CHANGED
@@ -1,23 +1,33 @@
1
  ---
 
 
 
2
  tags:
3
- - generated_from_trainer
4
  model-index:
5
- - name: roberta-base-cased
6
  results: []
 
 
 
 
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
- # roberta-base-cased
13
 
14
- This model was trained from scratch on an unknown dataset.
15
- It achieves the following results on the evaluation set:
16
- - Loss: 0.7516
17
 
18
  ## Model description
19
 
20
- More information needed
 
 
 
 
 
21
 
22
  ## Intended uses & limitations
23
 
@@ -25,7 +35,7 @@ More information needed
25
 
26
  ## Training and evaluation data
27
 
28
- More information needed
29
 
30
  ## Training procedure
31
 
 
1
  ---
2
+ language: en
3
+ pipeline_tag: fill-mask
4
+ license: cc-by-sa-4.0
5
  tags:
6
+ - legal
7
  model-index:
8
+ - name: lexlms/roberta-base
9
  results: []
10
+ widget:
11
+ - text: "The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of police."
12
+ datasets:
13
+ - lexlms/lexfiles
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
  should probably proofread and complete it, then remove this comment. -->
18
 
19
+ # LexLM base
20
 
21
+ This model was continued pre-trained from RoBERTa base (https://huggingface.co/roberta-base) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles).
 
 
22
 
23
  ## Model description
24
 
25
+ LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
26
+ * We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
27
+ * We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
28
+ * We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
29
+ * We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
30
+ * We consider mixed cased models, similar to all recently developed large PLMs.
31
 
32
  ## Intended uses & limitations
33
 
 
35
 
36
  ## Training and evaluation data
37
 
38
+ The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).
39
 
40
  ## Training procedure
41