kiddothe2b's picture
modify LeXFiles dataset link (#2)
9d4a6aa
metadata
language: en
pipeline_tag: fill-mask
license: cc-by-sa-4.0
tags:
  - legal
model-index:
  - name: lexlms/legal-roberta-large
    results: []
widget:
  - text: >-
      The applicant submitted that her husband was subjected to treatment
      amounting to <mask> whilst in the custody of police.
  - text: This <mask> Agreement is between General Motors and John Murray.
  - text: >-
      Establishing a system for the identification and registration of <mask>
      animals and regarding the labelling of beef and beef products.
  - text: >-
      Because the Court granted <mask> before judgment, the Court effectively
      stands in the shoes of the Court of Appeals and reviews the defendants’
      appeals.
datasets:
  - lexlms/lex_files

LexLM large

This model was continued pre-trained from RoBERTa large (https://huggingface.co/roberta-large) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lex_files).

Model description

LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:

  • We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
  • We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
  • We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
  • We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
  • We consider mixed cased models, similar to all recently developed large PLMs.

Intended uses & limitations

More information needed

Training and evaluation data

The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: tpu
  • num_devices: 8
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 256
  • total_eval_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.05
  • training_steps: 1000000

Training results

Training Loss Epoch Step Validation Loss
1.1322 0.05 50000 0.8690
1.0137 0.1 100000 0.8053
1.0225 0.15 150000 0.7951
0.9912 0.2 200000 0.7786
0.976 0.25 250000 0.7648
0.9594 0.3 300000 0.7550
0.9525 0.35 350000 0.7482
0.9152 0.4 400000 0.7343
0.8944 0.45 450000 0.7245
0.893 0.5 500000 0.7216
0.8997 1.02 550000 0.6843
0.8517 1.07 600000 0.6687
0.8544 1.12 650000 0.6624
0.8535 1.17 700000 0.6565
0.8064 1.22 750000 0.6523
0.7953 1.27 800000 0.6462
0.8051 1.32 850000 0.6386
0.8148 1.37 900000 0.6383
0.8004 1.42 950000 0.6408
0.8031 1.47 1000000 0.6314

Framework versions

  • Transformers 4.20.0
  • Pytorch 1.12.0+cu102
  • Datasets 2.7.0
  • Tokenizers 0.12.0

Citation

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}