LegalLexRoBERTa

RoBERTa base (https://huggingface.co/roberta-base) continued pretrained on the LEXfiles corpus (https://huggingface.co/datasets/lexlms/lex_files)
We omitted some datasets since they do not occur in our downstream tasks.
Datasets used: eu-court-cases, ecthr_cases, eu-legislation, indian_courts_cases, us_contracts, us-court-cases, us_legislation
Datasets not used: uk_courts_cases, uk_legislation, canadian_legislation, canadian_court_cases

Training details

Weights and Tokenizer from RoBERTa base
Trained on 128463 steps for 1 epoch on a sample from the LEXfiles corpus.

Hyperparameters

Learning Rate 5e-5
batch size per device 80
total batch size 80*4= 320
seed 0
adamw(b1=0.9, b2=0.98, eps=1e-6, weight_decay=0.01)
lr_scheduler_type: linear
warmup steps 6% of Train steps

Masking

normal RoBERTa masking:
15% mlm probability of which:

80% masked with [mask]
10% replaced with random token from RoBERTa tokenizer
10% unchanged

Sampling strategy

Used alpha sampling ratio like in https://github.com/coastalcph/lexlms/blob/main/compute_sampling_ratio.py
The total number of tokens in the sampled dataset was set as our target. Subsequently, we computed the rounded percentages of each individual dataset that needed to be sampled. Sampling was conducted by selecting the from the beginning of each dataset without randomization. If the required percentage exceeded 100%, the dataset was reused to meet the sampling goal.

Training Results

After Epoch 1 and 128463 Training Steps:
Loss: 0.6200678944587708, Acc: 0.8531131744384766

Framework versions

Transformers 4.41.1
Flax 0.8.3
Optax 0.2.2
Jax 0.4.28
Datasets 2.14.6
fsspec 2023.10.0

Attribution to LEXfiles:

Original paper with data corpus:
"LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" by Chalkidis*, Ilias and Garneau*, Nicolas and Goanta, Catalina and Katz, Daniel Martin and Søgaard, Anders, Source (https://arxiv.org/abs/2305.07507), used under CC BY-NC-SA 4.0 (https://spdx.org/licenses/CC-BY-NC-SA-4.0).

Training stats on TPU v4-8

available jax devices 4
average training batch time (in seconds): 0.89
average eval batch time (in seconds): 0.57
total time <35h

danielsbest
/

LegalLexRoBERTa