File size: 9,080 Bytes
3757344 c437e62 3757344 ad772b7 3757344 bfd5c75 3757344 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: cc
language:
- multilingual
- de
- fr
- it
tags:
- multilingual
datasets:
- MultiLegalPile
- LEXTREME
- LEXGLUE
---
# Model Card for joelito/legal-swiss-longformer-base
This model is a multilingual model pretrained on legal data. It is based on XLM-R ([base](https://huggingface.co/xlm-roberta-base) and [large](https://huggingface.co/xlm-roberta-large)). For pretraining we used [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)), a multilingual dataset from various legal sources covering 24 languages.
## Model Details
### Model Description
- **Developed by:** Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:joel.niklaus.2@bfh.ch)
- **Model type:** Transformer-based language model (Longformer)
- **Language(s) (NLP):** de, fr, it
- **License:** CC BY-SA
## Uses
### Direct Use and Downstream Use
You can utilize the raw model for masked language modeling since we did not perform next sentence prediction. However, its main purpose is to be fine-tuned for downstream tasks.
It's important to note that this model is primarily designed for fine-tuning on tasks that rely on the entire sentence, potentially with masked elements, to make decisions. Examples of such tasks include sequence classification, token classification, or question answering. For text generation tasks, models like GPT-2 are more suitable.
Additionally, the model is specifically trained on legal data, aiming to deliver strong performance in that domain. Its performance may vary when applied to non-legal data.
### Out-of-Scope Use
For tasks such as text generation you should look at model like GPT2.
The model should not be used to intentionally create hostile or alienating environments for people. The model was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.
## Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
## How to Get Started with the Model
See [huggingface tutorials](https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt). For masked word prediction see [this tutorial](https://huggingface.co/tasks/fill-mask).
## Training Details
This model was pretrained on [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)).
Our pretraining procedure includes the following key steps:
(a) Warm-starting: We initialize our models from the original XLM-R checkpoints ([base](https://huggingface.co/xlm-roberta-base) and [large](https://huggingface.co/xlm-roberta-large)) of [Conneau et al. (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf) to benefit from a well-trained base.
(b) Tokenization: We train a new tokenizer of 128K BPEs to cover legal language better. However, we reuse the original XLM-R embeddings for lexically overlapping tokens and use random embeddings for the rest.
(c) Pretraining: We continue pretraining on Multi Legal Pile with batches of 512 samples for an additional 1M/500K steps for the base/large model. We use warm-up steps, a linearly increasing learning rate, and cosine decay scheduling. During the warm-up phase, only the embeddings are updated, and a higher masking rate and percentage of predictions based on masked tokens are used compared to [Devlin et al. (2019)](https://aclanthology.org/N19-1423).
(d) Sentence Sampling: We employ a sentence sampler with exponential smoothing to handle disparate token proportions across cantons and languages, preserving per-canton and language capacity.
(e) Mixed Cased Models: Our models cover both upper- and lowercase letters, similar to recently developed large PLMs.
(f) Long Context Training: To account for long contexts in legal documents, we train the base-size multilingual model on long contexts with windowed attention. This variant, named Legal-Swiss-LF-base, uses a 15% masking probability, increased learning rate, and similar settings to small-context models.
### Training Data
This model was pretrained on [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)).
#### Preprocessing
For further details see [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)
#### Training Hyperparameters
- batche size: 512 samples
- Number of steps: 1M/500K for the base/large model
- Warm-up steps for the first 5\% of the total training steps
- Learning rate: (linearly increasing up to) 1e-4
- Word masking: increased 20/30\% masking rate for base/large models respectively
## Evaluation
For performance on downstream tasks, such as [LEXTREME](https://huggingface.co/datasets/joelito/lextreme) ([Niklaus et al. 2023](https://arxiv.org/abs/2301.13126)) or [LEXGLUE](https://huggingface.co/datasets/lex_glue) ([Chalkidis et al. 2021](https://arxiv.org/abs/2110.00976)), we refer to the results presented in Niklaus et al. (2023) [1](https://arxiv.org/abs/2306.02069), [2](https://arxiv.org/abs/2306.09237).
### Model Architecture and Objective
It is a RoBERTa-based model. Run the following code to view the architecture:
```
from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-swiss-longformer-base')
print(model)
LongformerModel(
(embeddings): LongformerEmbeddings(
(word_embeddings): Embedding(128000, 768, padding_idx=0)
(position_embeddings): Embedding(4098, 768, padding_idx=0)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): LongformerEncoder(
(layer): ModuleList(
(0-11): 12 x LongformerLayer(
(attention): LongformerAttention(
(self): LongformerSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(query_global): Linear(in_features=768, out_features=768, bias=True)
(key_global): Linear(in_features=768, out_features=768, bias=True)
(value_global): Linear(in_features=768, out_features=768, bias=True)
)
(output): LongformerSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): LongformerIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): LongformerOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): LongformerPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
```
### Compute Infrastructure
Google TPU.
#### Hardware
Google TPU v3-8
#### Software
pytorch, transformers.
## Citation [optional]
```
@misc{rasiah2023scale,
title={SCALE: Scaling up the Complexity for Advanced Language Model Evaluation},
author={Vishvaksenan Rasiah and Ronja Stern and Veton Matoshi and Matthias Stürmer and Ilias Chalkidis and Daniel E. Ho and Joel Niklaus},
year={2023},
eprint={2306.09237},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{Niklaus2023MultiLegalPileA6,
title={MultiLegalPile: A 689GB Multilingual Legal Corpus},
author={Joel Niklaus and Veton Matoshi and Matthias Sturmer and Ilias Chalkidis and Daniel E. Ho},
journal={ArXiv},
year={2023},
volume={abs/2306.02069}
}
```
## Model Card Authors
Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:joel.niklaus.2@bfh.ch)
Veton Matoshi: [huggingface](https://huggingface.co/kapllan); [email](mailto:msv3@bfh.ch)
## Model Card Contact
Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:joel.niklaus.2@bfh.ch)
Veton Matoshi: [huggingface](https://huggingface.co/kapllan); [email](mailto:msv3@bfh.ch)
|