law-ai commited on
Commit
86d70c1
1 Parent(s): f70b87f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -9
README.md CHANGED
@@ -10,15 +10,17 @@ Model and tokenizer files for the InCaseLawBERT model.
10
 
11
  ### Training Data
12
  For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
13
- These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/)), the erstwhile website of the Legal Information Institute of India,
14
- the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
15
  The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
16
- Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
17
  In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
18
  The raw text corpus size is around 27 GB.
19
 
20
  ### Training Objective
21
  This model is initialized with the [Legal-BERT model](https://huggingface.co/zlucia/legalbert) from the paper [When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings](https://dl.acm.org/doi/abs/10.1145/3462757.3466088). In our work, we refer to this model as CaseLawBERT, and our re-trained model as InCaseLawBERT.
 
 
 
 
 
22
 
23
  ### Usage
24
  Using the tokenizer (same as [CaseLawBERT](https://huggingface.co/zlucia/legalbert))
@@ -31,10 +33,26 @@ Using the model to get embeddings/representations for a sentence
31
  from transformers import AutoModel
32
  model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
33
  ```
34
- Using the model for further pre-training with MLM and NSP
35
- ```python
36
- from transformers import BertForPreTraining
37
- model_with_pretraining_heads = BertForPreTraining.from_pretrained("law-ai/InCaseLawBERT")
38
- ```
39
 
40
- ### Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ### Training Data
12
  For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
 
 
13
  The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
 
14
  In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
15
  The raw text corpus size is around 27 GB.
16
 
17
  ### Training Objective
18
  This model is initialized with the [Legal-BERT model](https://huggingface.co/zlucia/legalbert) from the paper [When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings](https://dl.acm.org/doi/abs/10.1145/3462757.3466088). In our work, we refer to this model as CaseLawBERT, and our re-trained model as InCaseLawBERT.
19
+ We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
20
+
21
+ ### Model Overview
22
+ This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
23
+ 12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
24
 
25
  ### Usage
26
  Using the tokenizer (same as [CaseLawBERT](https://huggingface.co/zlucia/legalbert))
 
33
  from transformers import AutoModel
34
  model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
35
  ```
 
 
 
 
 
36
 
37
+ ### Fine-tuning Results
38
+
39
+ ### Citation
40
+ ```
41
+ @inproceedings{paul-2022-ptinlegal,
42
+ title = "Pre-training Transformers on Indian Legal Text",
43
+ author = "Paul, Shounak and
44
+ Mandal, Arpan and
45
+ Goyal, Pawan and
46
+ Ghosh, Saptarshi",
47
+ eprinttype = {arXiv}
48
+ }
49
+ ```
50
+ ### About Us
51
+ We are a group of researchers from the Department of Computer Science and Technology, Indian Insitute of Technology, Kharagpur.
52
+ Our research interests are primarily ML and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario.
53
+ We have, and are currently working on several legal tasks such as:
54
+ * named entity recognition, summarization of legal documents
55
+ * semantic segmentation of legal documents
56
+ * legal statute identification from facts, court judgment prediction
57
+ * legal document matching
58
+ You can find our publicly available codes and datasets [here](https://github.com/Law-AI)