Initiang from the recent work of Chalkidis, Garneau, et al., "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development", we release legal NLP resources to broaden legal NLP research, while also helping practioners who aim to build assistive legal NLP technologies.
As of May 2023, we released:
- LeXFiles (https://huggingface.co/datasets/lexlms/lex_files), a new diverse English legal corpus including 11 sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus comprises approx. 6 million documents which sum up to approx. 19 billion tokens.
- LegalLAMA (https://huggingface.co/datasets/lexlms/legal_lama), a diverse probing benchmark suite comprising 8 sub-tasks that aims to assess the acquaintance of legal knowledge that PLMs acquired in pre-training.
- 2 new legal-oriented PLMs, dubbed LexLMs (https://huggingface.co/models?search=lexlms/legal-roberta), warm-started from the RoBERTa models, and further pre-trained on LeXFiles for 1M additional steps.