File size: 1,746 Bytes

50c1ae6
7d17a94
f72722a
7d17a94
 
50c1ae6
 
0d71062
 
 
 
 
 
 
 
 
 
74ff41d
8005ecd
74ff41d
 
 
 
 
0d71062
 
8005ecd
c8d6ef6
8005ecd
 
 
 
 
 
 
 
74ff41d
 
0d71062

---
language: en
pipeline_tag: fill-mask
tags: 
- legal
license: mit
---

###  InLegalBERT
Model and tokenizer files for the InLegalBERT model.

### Training Data
For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
In total, our dataset contains around 5.4 million Indian legal documents (all in the English language). 
The raw text corpus size is around 27 GB.

### Training Setup
This model is initialized with the [LEGAL-BERT-SC model](https://huggingface.co/nlpaueb/legal-bert-base-uncased) from the paper [LEGAL-BERT: The Muppets straight out of Law School](https://aclanthology.org/2020.findings-emnlp.261/). In our work, we refer to this model as LegalBERT, and our re-trained model as InLegalBERT.
We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.

### Model Overview
This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters

### Usage
Using the tokenizer (same as [LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased))
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
```
Using the model to get embeddings/representations for a sentence
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("law-ai/InLegalBERT")
```

### Fine-tuning Results

### Citation