File size: 2,197 Bytes
36474ce
7ba08c8
 
 
 
 
 
 
 
5c237a7
7ba08c8
9a51c41
7ba08c8
36474ce
7ba08c8
 
 
 
 
5c237a7
7ba08c8
 
 
 
 
 
9a51c41
 
 
 
 
 
 
 
 
 
7ba08c8
 
 
 
5c237a7
 
7ba08c8
 
 
5c237a7
7ba08c8
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
language:
- es
license: mit
widget:
- text: "La Constitución española de 1978 es la <mask> suprema del ordenamiento jurídico español."
tags:
- Long documents
- longformer
- robertalex
- spanish
- legal

---

# Legal ⚖️ longformer-base-4096-spanish

## [Longformer](https://arxiv.org/abs/2004.05150) is a Transformer model for long documents. 

`legal-longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint (**[RoBERTalex](PlanTL-GOB-ES/RoBERTalex)** in this case) and pre-trained for *MLM* on long documents from the [Spanish Legal Domain Corpora](https://zenodo.org/record/5495529/#.Y205lpHMKV5). It supports sequences of length up to **4,096**!

**Longformer** uses a combination of a sliding window (*local*) attention and *global* attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.


This model was made following the research done by [Iz Beltagy and Matthew E. Peters and Arman Cohan](https://arxiv.org/abs/2004.05150).

## Model (base checkpoint)
[RoBERTalex](https://huggingface.co/PlanTL-GOB-ES/RoBERTalex?)
There are few models trained for the Spanish language. Some of the models have been trained with a low resource, unclean corpora. The ones derived from the Spanish National Plan for Language Technologies are proficient in solving several tasks and have been trained using large-scale clean corpora. However, the Spanish Legal domain language could be thought of as an independent language on its own. We, therefore, created a Spanish Legal model from scratch trained exclusively on legal corpora.

## Dataset
[Spanish Legal Domain Corpora](https://zenodo.org/record/5495529)
A collection of corpora of Spanish legal domain.

More legal domain resources: https://github.com/PlanTL-GOB-ES/lm-legal-es

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{narrativa2022legal-longformer-base-4096-spanish,
  title={Legal Spanish LongFormer by Narrativa},
  author={Romero, Manuel},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/Narrativa/legal-longformer-base-4096-spanish}},
  year={2022}
}
```