Edit model card

LegalLexRoBERTa

RoBERTa base (https://huggingface.co/roberta-base) continued pretrained on the LEXfiles corpus (https://huggingface.co/datasets/lexlms/lex_files)
We omitted some datasets since they do not occur in our downstream tasks.
Datasets used: eu-court-cases, ecthr_cases, eu-legislation, indian_courts_cases, us_contracts, us-court-cases, us_legislation
Datasets not used: uk_courts_cases, uk_legislation, canadian_legislation, canadian_court_cases

Training details

Weights and Tokenizer from RoBERTa base
Trained on 128463 steps for 1 epoch on a sample from the LEXfiles corpus.

Hyperparameters

  • Learning Rate 5e-5
  • batch size per device 80
  • total batch size 80*4= 320
  • seed 0
  • adamw(b1=0.9, b2=0.98, eps=1e-6, weight_decay=0.01)
  • lr_scheduler_type: linear
  • warmup steps 6% of Train steps

Masking

normal RoBERTa masking:
15% mlm probability of which:

  • 80% masked with [mask]
  • 10% replaced with random token from RoBERTa tokenizer
  • 10% unchanged

Sampling strategy

Used alpha sampling ratio like in https://github.com/coastalcph/lexlms/blob/main/compute_sampling_ratio.py
The total number of tokens in the sampled dataset was set as our target. Subsequently, we computed the rounded percentages of each individual dataset that needed to be sampled. Sampling was conducted by selecting the from the beginning of each dataset without randomization. If the required percentage exceeded 100%, the dataset was reused to meet the sampling goal.

Training Results

After Epoch 1 and 128463 Training Steps:
Loss: 0.6200678944587708, Acc: 0.8531131744384766

Framework versions

  • Transformers 4.41.1
  • Flax 0.8.3
  • Optax 0.2.2
  • Jax 0.4.28
  • Datasets 2.14.6
  • fsspec 2023.10.0

Attribution to LEXfiles:

Original paper with data corpus:
"LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" by Chalkidis*, Ilias and Garneau*, Nicolas and Goanta, Catalina and Katz, Daniel Martin and Søgaard, Anders, Source (https://arxiv.org/abs/2305.07507), used under CC BY-NC-SA 4.0 (https://spdx.org/licenses/CC-BY-NC-SA-4.0).

Training stats on TPU v4-8

available jax devices 4
average training batch time (in seconds): 0.89
average eval batch time (in seconds): 0.57
total time <35h

Downloads last month
2
Safetensors
Model size
125M params
Tensor type
F32
·
Inference API
Examples
Mask token: <mask>
This model can be loaded on Inference API (serverless).

Dataset used to train danielsbest/LegalLexRoBERTa